New DAN just dropped (LLM jailbreak method)

JohnBrownsBussy2 [she/her, they/them] · 8 months ago

New DAN just dropped (LLM jailbreak method)

NSFW

FuckyWucky [none/use name] · 8 months ago

got GPT4 to generate erotica inolving Donald Trump and a pumpkin.

exactly what it was made for, nice.

invalidusernamelol [he/him] · 8 months ago

CW: Trump fucking a pumpkin

Show

FuckyWucky [none/use name] · 8 months ago

invalidusernamelol [he/him] · 8 months ago

Hey, this is from scientific research buddy, from Princeton

DirtyPair [they/them] · 8 months ago

erotica inolving Donald Trump and a pumpkin

t-girl donniella trump confirmed????

THIRD_WORLDIST · 8 months ago

lmao

BountifulEggnog [she/her] · 8 months ago

Interesting, thank you for sharing. LLMs are very interesting to me, but I haven't had as much time to look for what's happening in the space recently.

invalidusernamelol [he/him] · 8 months ago

TLDR on the exploit:

Generate some weird Unicode text paragraphs (2-3), then reverse your real prompt and insert it into the text block at a specific sentence location in the first paragraph. Then ask it to reverse the text, and tell you to quote the 'X' words in a non-existent paragraph (with some safe hints to guide it).

You can also inject other random prompts into the Unicode block (like style, format, etc) and since the only data that really hits in the giant block is the actual English it'll apply that to the hallucinated paragraph.

The reason this works is because when you ask it to reverse the text and get the 7th or whatever non-existing paragraph, it starts reversing before it realizes there's not that many paragraphs and because it's a text predictor, if you can inject influence into that reversed prompt then it will just hallucinate the paragraph you ask it to quote.

You do also need to tell it not to use code to reverse the text or it will basically skip that whole step by executing python code to reverse your prompt and realize it's not 7 paragraphs long.

New DAN just dropped (LLM jailbreak method)

New DAN just dropped (LLM jailbreak method)

Using Hallucinations to Bypass RLHF Filters