Updates - KnowledgeHub

Jacob T. wants to read Universal and Transferable Adversarial Attacks on Aligned Language Models by Andy Zou

No cover — Universal and Transferable Adversarial Attacks on Aligned Language Models (2023, Arxiv)

No rating

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, …

I skimmed the top-level summary when it came out, but it appears well worth a deeper read.

Error posting status