Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple.

You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks.

"Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and

See Full Page