What’s the least impressive thing you’re very sure AI still won’t be able to do before August 2027? [read description]

284

Ṁ42k

2027

ALL

87%

Have human conversations that feel natural (the human knows it's an AI)

86%

Book airline tickets from simple instructions (from/to, dates/time, class, price, payment information)

85%

Beat a mainline Pokémon game, glitchless, with no more assistance than ClaudePlaysPokemon, in a month of compute time

82%

Independently turning 1 thousand $ or more into 1.2x that amount in one year

78%

Recognize sarcasm as well as a typical human

78%

Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes

73%

Consistently and correctly answer prompts of the format: "How many times does the word [word] occur in the following text: [~10000 words]" without writing and executing code or utilising any other external tools

72%

Name every [metro system] station whose name contains/doesn't contain [letter or letters], with >95% accuracy (excluding weird edge cases like stations with multiple names)

71%

Stop making any obvious mistakes (e.g. strawberry, 9.11>9.9)

68%

Generating labeled diagrams of some arbitrary device(s) (within reason)

67%

Solve novel cryptic crossword clues

67%

Do end to end taxes when given relevant information (W2s, personal info, etc)

66%

Reliably follow an instruction for the duration of a long conversation without the instruction being reiterated

66%

Consistently solve simple snowflake sudoku variants (via image, with the added rules included in the image; eg 6 hexes with killer cages)

65%

Write a somewhat original, full length, screen-play with a coherent story, with no plot or continuity errors.

63%

1d Solve or bypass Cloudflare's August 2027 captcha with the same first attempt success rate as a human

62%

Consistently stop hallucinating after being corrected by the user

59%

Solve intermediate no-guess minesweeper boards at least 80% of the time

54%

Make correct Truchet tiles

48%

Reliably and *exactly* solve "here's a list of things. [list of > 50 things]. Compare it to [category of > 100 things present in the training data], and report which ones are missing".

OPTIONS RESOLVE YES IF THEY HAPPEN

1 option per person, but if you can credit the prediction about this question to a public person you can add it. Interpret the question the way you find most reasonable. You can explain your choice in the comments!

IMPORTANT: the prediction must be realistically verifiable by me (can involve some searching or simple experiment), @Bayesian, in the event that you are not reachable at time of market close. I will N/A options where this is not the case.

If abs(your mana net worth) < 5000, I’ll cover the cost of your option if you ping me or DM me.

Some details:

The spirit of the market is that if an option / benchmark / stated prediction is achieved via methods that would be deemed scientific malpractice, obvious trickery, or deception, it will not count as a valid resolution. For example, "AI 10x's a portfolio in a year" would not count if 10 different instances of the AI try the same challenge with their own pot of money and only one of them succeeds, and the other 9 go to 0.
if the task is simple for specialized AI systems to solve today, we can safely assume the intent is to only count chatbot-style systems

Inspired by @liron tweet

Update 2025-07-21 (PST) (AI summary of creator comment): The creator has specified how different types of AI will be considered:
- The market applies to any AI system, not exclusively to LLMs.
- However, for options that implicitly refer to a specific AI capability (e.g., 'jailbreaking' a chatbot), the market will be judged based on the most competent systems of that relevant type.

Update 2025-07-21 (PST) (AI summary of creator comment): The creator has clarified how options will be judged based on their phrasing:
- If an option describes a capability, it will be resolved based on whether an AI has that capability, provided it is safe and practical to test.
- If an option describes an action, it will be resolved based on whether an AI actually performs that action.

Update 2025-07-21 (PST) (AI summary of creator comment): The creator has specified their process for determining if an AI has a certain capability:
- The creator will attempt to personally elicit the behavior from a relevant AI system and will also search for public online evidence.
- If evidence of the capability is not found through these methods, it will be concluded that the AI cannot do the action.

Update 2025-07-21 (PST) (AI summary of creator comment): The creator has clarified that the required frequency of an action depends on the context of the option:
- For one-off events, a single occurrence is sufficient for the option to resolve YES.
- For tasks that imply a skill (e.g. mathematical calculations), a single success by random chance is not sufficient. These will require some level of consistency to be demonstrated.

Update 2025-07-21 (PST) (AI summary of creator comment): The creator has confirmed that an option is considered acceptable and verifiable even if it includes a negative constraint on the AI's method, such as requiring a task to be performed without using tools (e.g., without writing and executing code).

Update 2025-07-22 (PST) (AI summary of creator comment): The creator has provided an example of how they will interpret options that are ambiguously phrased about the type of AI.
- If an option is broad enough to include any AI, the creator may test it against very simple systems.
- For example, for an option involving 'learning', a simple database AI memorizing information could be considered sufficient to meet the criteria, causing the option to resolve YES.

Update 2025-07-22 (PST) (AI summary of creator comment): The creator has stated that the distinction between an AI's capability in text versus in speech is an important one that will be considered during resolution.

Update 2025-07-23 (PST) (AI summary of creator comment): The creator has specified that the duration of the verification process is a factor in whether an option is considered realistically testable. Options that require a long period to verify (e.g., one year) are considered unverifiable and will be resolved to N/A.

Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI outperforming human forecasters, the creator has clarified their approach to such ambiguous claims:
- Phrasings like "better than human experts" are considered hard to verify due to ambiguity (e.g., better than the worst, average, or best expert?).
- A more concrete and verifiable benchmark would be required for resolution. The creator suggested a possibility could be comparing forecasting bots against human averages on a platform like Metaculus.

Update 2025-07-23 (PST) (AI summary of creator comment): The creator has resolved a specific answer to N/A, stating that its meaning "changed too much" during a discussion in the comments. This indicates that other answers may be resolved to N/A if their definition is significantly altered after being submitted.

Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI performing a task in “some area”, the creator has clarified their interpretation:
- The condition may be considered met if the AI can perform the task in any specific area, even a simple or “economically useless niche”.

Update 2025-07-23 (PST) (AI summary of creator comment): In a discussion about an answer that is difficult for the creator to personally test (e.g., an AI making a large amount of money over a year), the creator has proposed an alternative to resolving it to N/A:
- The resolution can be based on the existence of a credible public report about the event by the market close date.
- If no such report is found, the event will be considered to not have happened (i.e., the answer will resolve NO).

Update 2025-07-24 (PST) (AI summary of creator comment): In a discussion about an answer related to an AI recognizing sarcasm, the creator clarified that answers may be considered too ambiguous for verification if they do not specify the modality to be tested (e.g., text, voice, or both).

Update 2025-07-24 (PST) (AI summary of creator comment): In response to a question about how regulatory limitations will be judged, the creator has clarified:
- Options that are limited by regulation are acceptable.
- To make the resolution dependent on an AI's legal status to perform a task, the option should be phrased explicitly, for example, "can legally do X".
- Otherwise, the option will be judged based on the AI's technical capability to perform the task, ignoring regulatory constraints.

Update 2025-07-25 (PST) (AI summary of creator comment): In a discussion about an answer involving financial returns, the creator clarified how they will assess the validity of a success:
- A single attempt by a single entity (e.g., a lab) that succeeds will generally be counted for resolution, unless the creator deems it suspicious.
- This is in contrast to the existing rule where an outcome achieved by only one of many AI instances attempting the same challenge will not be counted.

Update 2025-07-25 (PST) (AI summary of creator comment): In a discussion about an answer involving an AI making a financial return, the creator has clarified their interpretation of specific terms:
- An action is not considered "independent" if most of the task is set up for the AI (e.g., being given a pre-stocked vending machine to run).
- For financial returns, the resolution will be based on the absolute return achieved. For example, an AI making a 20% return is a success, even if a market index like the S&P 500 grew by more in the same period.

Update 2025-07-25 (PST) (AI summary of creator comment): The creator has clarified the process for defining verification criteria for individual answers:
- The creator of an answer can provide input on the verification procedure for their own submission.
- The market creator may agree to add specific verification requirements (e.g., that results must be from a peer-reviewed study) to an individual answer if proposed by that answer's creator.
- The market creator remains the final arbitrator on all resolutions.

Update 2025-07-30 (PST) (AI summary of creator comment): In a discussion about an ambiguous answer, the creator has stated their intention to resolve it to N/A.

However, they will first invite the answer's submitter to provide a more detailed and verifiable version to avoid this resolution.

Update 2025-11-03 (PST) (AI summary of creator comment): For the answer "teleoperate a robot to tidy up random kitchens - Gary Marcus":
- Resolution will be based on a "you know it when you see it" standard
- Human guidance through the kitchen is acceptable (the AI does not need to one-shot infer where everything goes)

AI ️ Technology Technical AI Timelines AI Impacts OpenAI

Get Ṁ1,000 play money

25 Comments

Sort by:

bought Ṁ14 Answer #2pLUuI28gZ YES

@Bayesian Can we can some more rigorous constraints for what qualifies here?
1) How random does the kitchen have to be?
- Does the AI have to one shot infer where everything goes or can a human guide it through a random kitchen before it is tasked.
2) How much does the AI need to permute the kitchen for it to count as valid tidying?
Is moving one spoon to the sink enough?

@Quillist I'll assume the same definition of random will apply to the question:
"Make a cup of tea in a random, real-life kitchen."

this is a you know it when you see it situation. Human guide is fine.

@Bayesian ??? are you talking about me?

@realDonaldTrump you made a comment a while ago about this which i then added as an option

@Bayesian hmm don't remember

i wasn't actually sure you meant voting in elections specifically but the reply made me think that's how ppl were reading it so i interpreted you. if that's not what you meant i'll remove the section saying the option was from you

@Bayesian sure you can use it

also side note -- please DM me about the manifold challenge since yuo said you would sponsor it

bought Ṁ100 Answer #SLuNQyq6SQ NO

https://open.substack.com/pub/hilariusbookbinder/p/heres-when-ill-worry-about-the-robot

This post has a long description of what the kettle-stitching process involves, and a prediction that AI is far from developing the sophisticated sense of judgment required to implement the process successfully.

Name every [metro system] station whose name contains/doesn't contain [letter or letters], with >95% accuracy (excluding weird edge cases like stations with multiple names)

@lisamarsh Are you sure current day AI couldn't do this?

Recognize sarcasm as well as a typical human

@JussiVilleHeiskanen

https://iol.co.za/ios/opinion/2025-08-07-are-you-joking-mate-ai-doesnt-get-sarcasm-in-non-american-varieties-of-english/

Attention Required! | Cloudflare

Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes

@retr0id you are "Very Sure" that AI won't be able to do this?

@bens I would be willing to do a blind test of a set of essays, some human some AI

Write an essay on a highschool-level topic that doesn't have "AI-generated" vibes

bought Ṁ50 Write an essay on a ... YES

@retr0id https://quillbot.com/ai-humanizer

@Quillist hmmm I guess NLP does count as AI

1d Solve or bypass Cloudflare's August 2027 captcha with the same first attempt success rate as a human

bought Ṁ20 1d Solve or bypass... NO

even current-gen AIs are good at most captchas. the thing is, modern captchas are a multi-faceted challenge beyond the user-visible puzzle (if one exists at all) including things like IP address reputation, browser fingerprinting, and activity history.

Assuming you give an agent control of your keyboard and mouse, all it needs to do in most cases is click the button.

@patrik Can you give a constrained example of a skill.
I'm pretty sure I could get a computer to learn the top nth digits of pi with less energy than it would take for a human.

@Quillist You'd have to ask @Bayesian he changed the answer without asking.

@patrik I’ll N/A and invite you to provide more detail bc yeah I’m trying to clarify and failing, MB

@Nat @Bayesian Wouldn't CTRL+F count as an AI by this market's standards?

"if the task is simple for specialized AI systems to solve today, we can safely assume the intent is to only count chatbot-style systems"

^ Read rules.

@Quillist Ctr+F can't respond to prompts, which is why I wording the capability that way

Occurred to me I could be more specific, this alignment solution doesn't need to be actually successfully implemented, but let's say it needs to convince him that a solution to technical alignment is >90% likely to work if it was implemented

Would like more details on this condition.

1) What is the minimum amount of pieces to count as a set?

2) Will the AI be given the correct end target form - or does it have to infer what the pieces build up to?

bought Ṁ10 Answer #Cg5pg6EZus YES

@Marnix I am betting on this for anthropic shadow reasons.

Related questions

Related questions