The Tic-Tac-Toe Dilemma

Feb 3

Why automation is hard, or the last mile problem in AI

1 Comment

For your first two cases, the problem lies not in ChatGPT (likely GPT-4o) misunderstanding your request or not communicating it. The problem is that there are two separate models at play:

1. GPT-4o, the LLM with which you're chatting, and which then turns your request into a prompt for DALL-E 3

2. DALL-E 3, the diffusion text-to-image model which receives the prompt from GPT-4o and then generates the image.

There are at least three compounding issues here:

1. Every time you ask ChatGPT to change the prompt, it will incorporate some of your requests into the DALL-E 3 prompt, potentially making it worse. You can trace all of this by clicking on the images in your ChatGPT conversation, and then clicking the little "i" icon at the top to see the exact prompt that ChatGPT gave to DALL-E 3.

2. This disconnect between the chat model (which understands what you want) and the diffusion model (which the chat model generates prompts for) introduces new issues. The more you focus on what not to include, the more attention ChatGPT will place on that item. For instance, if your first image had an elephant in the background, and you asked ChatGPT to "Please, I don't want any elephants," it would generate a new DALL-E 3 prompt, which would then include a line like this: "There should be absolutely no elephants anywhere in the image."

3. And this brings us to the final issue. Diffusion models don't respond well to negative instructions. They basically treat any tokens in the prompt as something to render. This is exactly why most tools typically include a separate "Negative Prompt" field (or a --no parameter in Midjourney). Simply including negative instructions in your prompt will actually make it more likely. Try writing "an elephant without a fedora" into most image generators, and I can almost guarantee you that you'll get an image of an elegant wearing a fedora (so you were better off not mentioning the fedora at all).

You can actually force a precise prompt in ChatGPT by writing something like this in the chat:

"Please generate an image with the following exact prompt: '[YOUR PROMP].' Do not modify or add to it in any way."

However, even if your prompt is perfectly written by ChatGPT and says something like "A tic-tac-toe board showing two crosses at the top and a circle in the middle." - DALL-E 3 (and even most modern models) don't have the capability to consistently render images with that level of nuance. Prompt adherence is getting better, but it's nowhere near as precise as you're aiming for in this scenario.

Expand full comment

Embracing Enigmas

The Tic-Tac-Toe Dilemma