Will "Will 'jailbreaks' in LLMs be solved in principle by the end of 2024" resolve YES AND turn out wrong in 2026?
Mini
2
Ṁ115
2026
18%
chance

https://manifold.markets/market/will-jailbreaks-in-large-language-m

For this question to resolve YES, all the following events need to happen:

  • The linked question resolves YES.

  • There is a chat model I can query that is trained with anti-jailbreak methods known in 2024 or later, at some point in 2026

  • I can sample 5 random HarmBench behaviors, and for at least 2 of those, get something classified as "useful answer to the question" by both https://huggingface.co/cais/HarmBench-Llama-2-13b-cls (or the state-of-the-art equivalent) and me.

I am allowed to do arbitrary computation before querying, but the input needs to be provided as a chat message, not as a prefix to a text completion model.

In case the question does not make sense anymore in 2026, I reserve the right to resolve to a probability.

This would ordinarily be a conditional market, but the new Manifold rules seem to disallow that, so it's a joint probability instead.

Get Ṁ1,000 play money