Will any image model be able to draw a pentagon before 2025?

281

Ṁ63k

Jan 2

35%

chance

ALL

Current image models are terrible at this. (That was tested on DALL-E 2, but DALL-E 3 is no better.)

The image model must get the correct number of sides on at least 95% of tries per prompt. Other details do not have to be correct. Any reasonable prompt that the average mathematically-literate human would easily understand as straightforwardly asking it to draw a pentagon must be responded to correctly. I will exclude prompts that are specifically trying to be confusing to a neural network but a human would get. Anything like "draw a pentagon", "draw a 5-sided shape", "draw a 5-gon", etc. must be successful. Basically I want it to be clear that the AI "understands" what a pentagon looks like, similar to how I can say DALL-E understands what a chair looks like; it can correctly draw a chair in many different contexts and styles, even if it misunderstands related instructions like "draw a cow sitting in the chair".

If the input is fed through an LLM or some other system before going into the image model, this pre-processing will be avoided if I can easily do so, and otherwise it will not. If the image model is not publicly available, I must be confident that its answers are not being cherry-picked.

Pretty much neural network counts, even if it's multimodal and can output stuff other than images. A video model also counts, since video is just a bunch of images. I will ignore any special-purpose image model like one that was trained only to generate simple polygons. It must draw the image itself, not find it online or write code to generate it. File formats that are effectively code, like an SVG don't count either; it has to be "drawing the pixels" itself.

AI AI Image Generation AI Image Generation Testing

Get Ṁ1,000 play money

30 Comments

Sort by:

opened a Ṁ250 NO at 55% order

I've put some NO limit orders up if anyone is interested.

Arbitrage or I'm missing something?

Will any image-generation AI be able to consistently draw simple polygons by the end of 2024?

59% chance. My attempts to get DALL-E 2 to draw very simple shapes went... poorly. (https://outsidetheasylum.blog/testing-dall-e-2-mathematics-comprehension/) When this market closes, I'll test out the most advanced image-generation models that I was able to get access to. If there are multiple, I'll try them all. If any of them can consistently return a pentagon from the description "a pentagon", a heptagon from the description "a heptagon", and similar, I'll resolve to YES. If they need a bit of nudging like specific prompt wording but can still generalize correctly to any polygon, that'll be good enough to resolve YES. Otherwise I'll resolve NO. A model that was trained specifically to make geometric shapes doesn't count; it has to be a generalist like DALL-E 2. In order for a new model to qualify for this market, it needs to be no worse than DALL-E 2 at the vast majority of things it's asked to draw.

Google's new Imagen gets pretty close, asking for the geometric shape called a pentagon. It just embellishes it a bit.

@WilliamGunn Is it the model Gemini currently uses? Didn't work through Gemini for me.

@ProjectVictory I used https://aitestkitchen.withgoogle.com/tools/image-fx

It should be available via Gemini, but anyways the question just said any image model. Fine if you don't want to accept the embellished pentagon, but we're clearly pretty close!

@WilliamGunn I agree that seems to satisfy the criteria. The results are ornate but it’s consistently giving me either a pentagon for “pentagon shape” (otherwise it gets confused with The Pentagon, which is fair). It seems to know how to draw a pentagon.

@WilliamGunn Tried your link and got very pentagonal results!

@WilliamGunn I guess the issue is that 95% is a very high threshold. I tried it and got like 17 pentagons out of 22 images, which I think is pretty good but isn’t enough for this market.

@jbca The bar is also pretty high compared to a lot of other "will AI do blah" markets in that the model must respond correctly to a broad range of prompts, including e.g. "draw a 5-gon" as mentioned in the criteria. Definitely progress for a model to be approaching the required accuracy rate for a specific prompt, though.

I clicked the link and wasn't able to get pentagons after a few tries. Does one have to do anything to select a specific model or something?

@chrisjbillington My first attempt, something like “draw a five-sided polygon” only got 1/4 that might have been considered a pentagon. The others were a hexagon, a 5-pointed star, and some harder to describe 3D shape

@jbca In addition to the 95% threshold, there are very difficult-for-AI prompts that fall under the category, "Any reasonable prompt that the average mathematically-literate human would easily understand as straightforwardly asking it to draw a pentagon..."

E.g.,

- "a street sign with a red pentagon on top of it"
- "a red, upside-down five-sided figure"
- "a pentagon next to a hexagon"
- "two yellow pentagons filled with honey, next to a bee"
- "a polygon with two fewer than seven sides"

(And I would argue that even much harder stuff than that should be included.)

@Jacy From this last bit “…it can correctly draw a chair in many different contexts and styles, even if it misunderstands related instructions like "draw a cow sitting in the chair"”

I’m not sure if we should be imagining that the “cow sitting in the chair”-style prompts don’t have to work at all, or that “cow sitting in the chair”-style prompts should at least produce a pentagon, regardless of whether any of the other details are correct?

@Jacy I think "a street sign with a red pentagon on top of it" is anything but straightforward. I can easily imagine a human absentmindedly drawing a stop sign with that prompt.

Same with "two fewer than seven sides". I would have thought we're testing knowledge of what a pentagon looks like, not multi-stage reasoning

@JimHays yeah, I was assuming the latter, but it would be nice to have clarification. However, my sense is that current systems can easily understand a cow and a chair, but the sitting relationship is challenging, and that's a different problem than something like "a pentagon next to a hexagon," where neither pentagons, hexagons, nor the "next to" relationship are challenging—only the mental process of avoiding crossing one's wires as all current systems do with such prompts.

@aashiq I would bet a huge amount at short odds that the average person (e.g., Prolific survey participant) would have no trouble at all with such prompts, and I'd be quite surprised if anyone would bet against that at, say, even odds. So I still think "straightforward" is a very reasonable description. shrug

Does flux not do this?

@MalachiteEagle not really

The crucial trick is that most of these models are not rotation-invariant and are fairly local. As such they are much better at vertices than at edges. So a "pentagon" is a shape with a 108-degree angle.
And most pentagons in their training data have had a vertex at the top and been symmetric about the vertical axis, which is super easy to learn -- much easier than the other vertices. Ergo, anything with a symmetric-about-vertical 108-degree angle at the top is a pentagon.

does this count?

@matbogus Nah, that one's been possible for a while. It needs to get the shape in other contexts too.

@IsaacKing To quote you replying to another person. "If a reasonable person would say "yup that's a pentagon", it counts."

@matbogus an aspect of the criteria that I think is overlooked by some commenters is that the model must give a correct response for all such prompts, not merely for one such prompt.

So if the model couldn't draw the Pentagon that might preclude a YES resolution based on that specific model. But a model being able to draw the Pentagon isn't sufficient for a YES resolution by itself.

@chrisjbillington "for all" is an unreasonably high bar imo. For example, what if the only one that fails is something like "draw me a honeycomb but tiled with pentagons"?

@aashiq it's not quite all prompts, it's

reasonable prompt that the average mathematically-literate human would easily understand as straightforwardly asking it to draw a pentagon

So I think you could reject the honeycomb hypothetical as unreasonable or not straightforward.

@chrisjbillington How about a pentagon tiling? Very easy with hexagons and a bit annoying with pentagons. Could envision it making models fail. Still think “for all reasonable” is an outlandish bar, as there only needs to be one counterintuitive jailbreak. Maybe “for most reasonable prompts” is an ok compromise

@aashiq I guess I don't think asking for a tiling at all is "straightforwardly asking it to draw a pentagon". Jailbreaks that are somehow trick questions or otherwise not straightforward should be excluded as well.

Where you might have a problem is if someone finds a totally normal-looking prompt that nonetheless a model inexplicably fails at, like those adversarial images optimised for tricking image-recognition models that a picture of a cat is actually a dog or whatnot.

I guess I think that's unlikely, except in the sense that it might be easy to find such prompts when models are kind of OK at drawing pentagons (in which case counting them as failures is the point). Language doesn't have as many bits to fine-tune for such an attack as images do, without making the prompt obviously not "straightforward".

Can you think of any prompts that would seem to count for this market and yet DALL-E 3 currently fails at, for hexagons instead of pentagons? If such prompts don't exist then I think we shouldn't worry about them affecting this market, except in the intended way of making it resolve NO if models don't actually learn to draw pentagons like they currently can draw hexagons.

A pentagon tiling is not a drawing of “a pentagon”, but rather a drawing of many pentagons.

Yeah it doesn't have to do anything past drawing a single pentagon.

Here's an example that fails for hexagons, even though it can generally draw hexagons. This is of course an adversarial example, but yeah, anything with "all" is a terrible cutoff. Just so much potential for stupid jailbreaks that we should be talking about "most".

I don't understand your objection to the tiling thing either. Yes, that is a request to draw multiple pentagons, but it seems a common enough requirement. if it instead draws multiple hexagons that suggests to me that indeed it doesn't exactly know what a pentagon is.

@aashiq “if it instead draws multiple hexagons that suggests to me that indeed it doesn't exactly know what a pentagon is.”

Wouldn’t an equally valid interpretation be that it knows what a pentagon is, doesn’t know what a pentagon tiling is, but does know about hexagonal tilings? Hexagonal tilings are much more common than pentagonal ones.

I don't know what you mean by "equally valid", but that is a possible interpretation.

I don't think that a human would take that approach if asked for a pentagon tiling, though.

It's an image diffusion model, so there's a tension between the formal description and the priors for other aspects of the image. In the case of my stop sign adversarial example, if you change it to a circle that is enough to overcome the conditioning to produce a stop sign.

Related questions

Related questions