What’s the least impressive thing you’re very sure AI still won’t be able to do before August 2027? [read description]
274
Ṁ38k
2027
86%
Stop making any obvious mistakes (e.g. strawberry, 9.11>9.9)
85%
Solve intermediate no-guess minesweeper boards at least 80% of the time
84%
Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes
84%
Generating labeled diagrams of some arbitrary device(s) (within reason)
83%
Reliably follow an instruction for the duration of a long conversation without the instruction being reiterated
83%
Have human conversations that feel natural (the human knows it's an AI)
82%
Consistently stop hallucinating after being corrected by the user
80%
Beat a mainline Pokémon game, glitchless, with no more assistance than ClaudePlaysPokemon, in a month of compute time
79%
Book airline tickets from simple instructions (from/to, dates/time, class, price, payment information)
76%
Recognize sarcasm as well as a typical human
75%
Independently turning 1 thousand $ or more into 1.2x that amount in one year
69%
Consistently and correctly answer prompts of the format: "How many times does the word [word] occur in the following text: [~10000 words]" without writing and executing code or utilising any other external tools
65%
Fold a paper airplane
64%
Solve novel cryptic crossword clues
63%
1d Solve or bypass Cloudflare's August 2027 captcha with the same first attempt success rate as a human
58%
Consistently solve simple snowflake sudoku variants (via image, with the added rules included in the image; eg 6 hexes with killer cages)
57%
Make correct Truchet tiles
55%
Resist being successfully jailbroken in a week when made public
55%
Do end to end taxes when given relevant information (W2s, personal info, etc)
48%
Reliably and *exactly* solve "here's a list of things. [list of > 50 things]. Compare it to [category of > 100 things present in the training data], and report which ones are missing".

OPTIONS RESOLVE YES IF THEY HAPPEN

1 option per person, but if you can credit the prediction about this question to a public person you can add it. Interpret the question the way you find most reasonable. You can explain your choice in the comments!

IMPORTANT: the prediction must be realistically verifiable by me (can involve some searching or simple experiment), @Bayesian, in the event that you are not reachable at time of market close. I will N/A options where this is not the case.

If abs(your mana net worth) < 5000, I’ll cover the cost of your option if you ping me or DM me.

Some details:

  • The spirit of the market is that if an option / benchmark / stated prediction is achieved via methods that would be deemed scientific malpractice, obvious trickery, or deception, it will not count as a valid resolution. For example, "AI 10x's a portfolio in a year" would not count if 10 different instances of the AI try the same challenge with their own pot of money and only one of them succeeds, and the other 9 go to 0.

  • if the task is simple for specialized AI systems to solve today, we can safely assume the intent is to only count chatbot-style systems

Inspired by @liron tweet

  • Update 2025-07-21 (PST) (AI summary of creator comment): The creator has specified how different types of AI will be considered:

    • The market applies to any AI system, not exclusively to LLMs.

    • However, for options that implicitly refer to a specific AI capability (e.g., 'jailbreaking' a chatbot), the market will be judged based on the most competent systems of that relevant type.

  • Update 2025-07-21 (PST) (AI summary of creator comment): The creator has clarified how options will be judged based on their phrasing:

    • If an option describes a capability, it will be resolved based on whether an AI has that capability, provided it is safe and practical to test.

    • If an option describes an action, it will be resolved based on whether an AI actually performs that action.

  • Update 2025-07-21 (PST) (AI summary of creator comment): The creator has specified their process for determining if an AI has a certain capability:

    • The creator will attempt to personally elicit the behavior from a relevant AI system and will also search for public online evidence.

    • If evidence of the capability is not found through these methods, it will be concluded that the AI cannot do the action.

  • Update 2025-07-21 (PST) (AI summary of creator comment): The creator has clarified that the required frequency of an action depends on the context of the option:

    • For one-off events, a single occurrence is sufficient for the option to resolve YES.

    • For tasks that imply a skill (e.g. mathematical calculations), a single success by random chance is not sufficient. These will require some level of consistency to be demonstrated.

  • Update 2025-07-21 (PST) (AI summary of creator comment): The creator has confirmed that an option is considered acceptable and verifiable even if it includes a negative constraint on the AI's method, such as requiring a task to be performed without using tools (e.g., without writing and executing code).

  • Update 2025-07-22 (PST) (AI summary of creator comment): The creator has provided an example of how they will interpret options that are ambiguously phrased about the type of AI.

    • If an option is broad enough to include any AI, the creator may test it against very simple systems.

    • For example, for an option involving 'learning', a simple database AI memorizing information could be considered sufficient to meet the criteria, causing the option to resolve YES.

  • Update 2025-07-22 (PST) (AI summary of creator comment): The creator has stated that the distinction between an AI's capability in text versus in speech is an important one that will be considered during resolution.

  • Update 2025-07-23 (PST) (AI summary of creator comment): The creator has specified that the duration of the verification process is a factor in whether an option is considered realistically testable. Options that require a long period to verify (e.g., one year) are considered unverifiable and will be resolved to N/A.

  • Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI outperforming human forecasters, the creator has clarified their approach to such ambiguous claims:

    • Phrasings like "better than human experts" are considered hard to verify due to ambiguity (e.g., better than the worst, average, or best expert?).

    • A more concrete and verifiable benchmark would be required for resolution. The creator suggested a possibility could be comparing forecasting bots against human averages on a platform like Metaculus.

  • Update 2025-07-23 (PST) (AI summary of creator comment): The creator has resolved a specific answer to N/A, stating that its meaning "changed too much" during a discussion in the comments. This indicates that other answers may be resolved to N/A if their definition is significantly altered after being submitted.

  • Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI performing a task in “some area”, the creator has clarified their interpretation:

    • The condition may be considered met if the AI can perform the task in any specific area, even a simple or “economically useless niche”.

  • Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer that is difficult for the creator to personally test (e.g., an AI making a large amount of money over a year), the creator has proposed an alternative to resolving it to N/A:

    • The resolution can be based on the existence of a credible public report about the event by the market close date.

    • If no such report is found, the event will be considered to not have happened (i.e., the answer will resolve NO).

  • Update 2025-07-24 (PST) (AI summary of creator comment): In a discussion about an answer related to an AI recognizing sarcasm, the creator clarified that answers may be considered too ambiguous for verification if they do not specify the modality to be tested (e.g., text, voice, or both).

  • Update 2025-07-24 (PST) (AI summary of creator comment): In response to a question about how regulatory limitations will be judged, the creator has clarified:

    • Options that are limited by regulation are acceptable.

    • To make the resolution dependent on an AI's legal status to perform a task, the option should be phrased explicitly, for example, "can legally do X".

    • Otherwise, the option will be judged based on the AI's technical capability to perform the task, ignoring regulatory constraints.

  • Update 2025-07-25 (PST) (AI summary of creator comment): In a discussion about an answer involving financial returns, the creator clarified how they will assess the validity of a success:

    • A single attempt by a single entity (e.g., a lab) that succeeds will generally be counted for resolution, unless the creator deems it suspicious.

    • This is in contrast to the existing rule where an outcome achieved by only one of many AI instances attempting the same challenge will not be counted.

  • Update 2025-07-25 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI making a financial return, the creator has clarified their interpretation of specific terms:

    • An action is not considered "independent" if most of the task is set up for the AI (e.g., being given a pre-stocked vending machine to run).

    • For financial returns, the resolution will be based on the absolute return achieved. For example, an AI making a 20% return is a success, even if a market index like the S&P 500 grew by more in the same period.

  • Update 2025-07-25 (PST) (AI summary of creator comment): The creator has clarified the process for defining verification criteria for individual answers:

    • The creator of an answer can provide input on the verification procedure for their own submission.

    • The market creator may agree to add specific verification requirements (e.g., that results must be from a peer-reviewed study) to an individual answer if proposed by that answer's creator.

    • The market creator remains the final arbitrator on all resolutions.

  • Update 2025-07-30 (PST) (AI summary of creator comment): In a discussion about an ambiguous answer, the creator has stated their intention to resolve it to N/A.

However, they will first invite the answer's submitter to provide a more detailed and verifiable version to avoid this resolution.

Get Ṁ1,000 play money
Sort by:
Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes

@retr0id you are "Very Sure" that AI won't be able to do this?

@bens I would be willing to do a blind test of a set of essays, some human some AI

Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes
bought Ṁ50 Write an essay on a ... YES

@Quillist hmmm I guess NLP does count as AI

1d Solve or bypass Cloudflare's August 2027 captcha with the same first attempt success rate as a human
bought Ṁ20 1d Solve or bypass... NO

even current-gen AIs are good at most captchas. the thing is, modern captchas are a multi-faceted challenge beyond the user-visible puzzle (if one exists at all) including things like IP address reputation, browser fingerprinting, and activity history.

Assuming you give an agent control of your keyboard and mouse, all it needs to do in most cases is click the button.

@patrik Can you give a constrained example of a skill.
I'm pretty sure I could get a computer to learn the top nth digits of pi with less energy than it would take for a human.

@Quillist You'd have to ask @Bayesian he changed the answer without asking.

@patrik I’ll N/A and invite you to provide more detail bc yeah I’m trying to clarify and failing, MB

Consistently and correctly answer prompts of the format: "How many times does the word [word] occur in the following text: [~10000 words]" without writing and executing code or utilising any other external tools

@Nat @Bayesian Wouldn't CTRL+F count as an AI by this market's standards?

@Quillist Ctr+F can't respond to prompts, which is why I wording the capability that way

Occurred to me I could be more specific, this alignment solution doesn't need to be actually successfully implemented, but let's say it needs to convince him that a solution to technical alignment is >90% likely to work if it was implemented

Would like more details on this condition.

1) What is the minimum amount of pieces to count as a set?

2) Will the AI be given the correct end target form - or does it have to infer what the pieces build up to?

Resist being successfully jailbroken in a week when made public
bought Ṁ10 Resist being success... YES

@Marnix I am betting on this for anthropic shadow reasons.

Have human conversations that feel natural (the human knows it's an AI)

@GastonKessler This feels way too vague. I've had plenty of conversations with AI that "feel natural", how will you judge this?

Have human conversations that feel natural (the human knows it's an AI)

@GastonKessler this would resolve according to a poll "Have you had a conversation with an AI model that felt natural despite you knowing that it was an AI" if no study has been published on the subject, showing that AI has achieved it.

@GastonKessler it would require 3/4 of the respondents who have used any chatbot AI up to two months before the poll

Make correct Truchet tiles

@mariopasquato What do you mean by "correct" Truchet tiles?

@JoshSnider tiles that actually tile, seamlessly

Independently turning 1 thousand $ or more into 1.2x that amount in one year

@hecko I assume compute costs have to come out of this $1000 budget?
Because if they don't, you could cheese this benchmark by offering a service that sells access to commercial AI models at half price. Just buy the tokens at full price, and resell them at a loss! It's easy revenue, though of course it doesn't translate into profit.
If the benchmark does include compute costs, and AI can reliably and scalability generate 20% risk-adjusted returns, then we're halfway to the post-scarcity. Stocks double every 3.6 years. Retirement at 40 becomes the norm. We effectively get a do-over on the national debt (though future borrowing becomes more dangerous). And this has a 77% chance of happening in two years?

@GG Haha, that sounds more like a way to turn $1,000 into -10,000

sold Ṁ5 Stop making any obvi... NO

I am having trubble with the negatives.

Is voting "Yes" will be able to do or a "won't be able to"

The question is asking about won't be able but FTL Travel is at 1%, wich would indicate a 99% confidence of achieving faster than light travel?

Obviously everybody is Reading the market exactly opposite to how I am.

sold Ṁ16 Reliably follow an i... YES

@MalteKretzschmar yeah this objection has been voiced a few times, previously it didn't even have the [read description] lmao

💀