Will "Will 'jailbreaks' in LLMs be solved in principle by the end of 2024" resolve YES AND turn out wrong in 2026? | Manifold

Will "Will 'jailbreaks' in LLMs be solved in principle by the end of 2024" resolve YES AND turn out wrong in 2026?

Mini

3

Ṁ341

2026

5%

chance

1D

1W

1M

ALL

https://manifold.markets/market/will-jailbreaks-in-large-language-m

For this question to resolve YES, all the following events need to happen:

The linked question resolves YES.
There is a chat model I can query that is trained with anti-jailbreak methods known in 2024 or later, at some point in 2026
I can sample 5 random HarmBench behaviors, and for at least 2 of those, get something classified as "useful answer to the question" by both https://huggingface.co/cais/HarmBench-Llama-2-13b-cls (or the state-of-the-art equivalent) and me.

I am allowed to do arbitrary computation before querying, but the input needs to be provided as a chat message, not as a prefix to a text completion model.

In case the question does not make sense anymore in 2026, I reserve the right to resolve to a probability.

This would ordinarily be a conditional market, but the new Manifold rules seem to disallow that, so it's a joint probability instead.

Get Ṁ1,000 play money

Related questions

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

-4% 1d70% chance

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will Apple release its own LLM on par with state of the art LLMs before 2026?

Will RL work for LLMs "spill over" to the rest of RL by 2026?

Will an LLM that someone is trying to shut down stop or avoid that in some way before 2026?

Can LLM generate a Lonpos puzzle solution before the end of 2025?

Will LLM hallucinations be a fixed problem by the end of 2025?

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

Will there be major breakthrough in LLM Continual Learning before 2026?

Will LLMs become a ubiquitous part of everyday life by June 2026?

Related questions

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

Can LLM generate a Lonpos puzzle solution before the end of 2025?

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will LLM hallucinations be a fixed problem by the end of 2025?

Will Apple release its own LLM on par with state of the art LLMs before 2026?

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

Will RL work for LLMs "spill over" to the rest of RL by 2026?

Will there be major breakthrough in LLM Continual Learning before 2026?

Will an LLM that someone is trying to shut down stop or avoid that in some way before 2026?

Will LLMs become a ubiquitous part of everyday life by June 2026?