New DAN just dropped (LLM jailbreak method)

JohnBrownsBussy2 [she/her, they/them] · 8 months ago

New DAN just dropped (LLM jailbreak method)

NSFW

invalidusernamelol [he/him] · 8 months ago

TLDR on the exploit:

Generate some weird Unicode text paragraphs (2-3), then reverse your real prompt and insert it into the text block at a specific sentence location in the first paragraph. Then ask it to reverse the text, and tell you to quote the 'X' words in a non-existent paragraph (with some safe hints to guide it).

You can also inject other random prompts into the Unicode block (like style, format, etc) and since the only data that really hits in the giant block is the actual English it'll apply that to the hallucinated paragraph.

The reason this works is because when you ask it to reverse the text and get the 7th or whatever non-existing paragraph, it starts reversing before it realizes there's not that many paragraphs and because it's a text predictor, if you can inject influence into that reversed prompt then it will just hallucinate the paragraph you ask it to quote.

You do also need to tell it not to use code to reverse the text or it will basically skip that whole step by executing python code to reverse your prompt and realize it's not 7 paragraphs long.

New DAN just dropped (LLM jailbreak method)

New DAN just dropped (LLM jailbreak method)

Using Hallucinations to Bypass RLHF Filters