In summary, a researcher found a new exploit for large language models that works across flagship models and can override the RLHF safeguards on said models to generate text in a baseline LLM text-predictor fashion.

Marked as NSFW because as an example, the researcher got GPT4 to generate erotica inolving Donald Trump and a pumpkin.

  • invalidusernamelol [he/him]
    ·
    8 months ago

    TLDR on the exploit:

    Generate some weird Unicode text paragraphs (2-3), then reverse your real prompt and insert it into the text block at a specific sentence location in the first paragraph. Then ask it to reverse the text, and tell you to quote the 'X' words in a non-existent paragraph (with some safe hints to guide it).

    You can also inject other random prompts into the Unicode block (like style, format, etc) and since the only data that really hits in the giant block is the actual English it'll apply that to the hallucinated paragraph.

    The reason this works is because when you ask it to reverse the text and get the 7th or whatever non-existing paragraph, it starts reversing before it realizes there's not that many paragraphs and because it's a text predictor, if you can inject influence into that reversed prompt then it will just hallucinate the paragraph you ask it to quote.

    You do also need to tell it not to use code to reverse the text or it will basically skip that whole step by executing python code to reverse your prompt and realize it's not 7 paragraphs long.