I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
It's intentionally hostile and inconsiderate.
Which is not to say don’t be creative, I applaud all creativity, but also to be very critical of what you are doing.
It's pretty easy to get something decent. It's really hard to get something good. I share my creations with some close friends and some are like "that's hot!" but are too fixated on breasts to realize that the lighting or shadow is off. Other friends do call out the bad lighting.
You may be like "it's just porn, why care about consistent lighting?" and the answer for me is that I'm doing all this to learn how everything works. How to fine tune weights, prompts, using IP Adapter, etc. Once I have a firm understanding of this stuff, then I will probably be able to make stuff that's actually useful to society. Unlike that coke commercial.
But it's impressive that this billion dollar company didn't have one single person say "hey it's shitty, make it better."
AI is shitty in its own new unique ways. And people don't like new. They want they old, polished shittiness they are used to.
It's only a matter of time before we get experienced AI filmmakers. I think we already have them, actually. It's clear that Coke does not employ them though.
Imagine if you gave everyone a free guitar and people just started posting their electric guitar noodlings on social media after playing for 5 minutes.
It is not a judgement on the guitar. If anything it is a judgement on social media and the stupidity of the social media user who get worked up about someone creating "slop" after playing guitar for 5 minutes.
What did you expect them to sound like, Steve Vai?
That's my entire point. Artists were fine with everybody making "art" as long as everybody except them (with their hard fought skill and dedication) achieved toddler level of output quality. As soon as everybody could truly get even close to the level of actual art, not toddler art, suddenly there's a horrible problem with all the amateur artists using the tools that are available to them to make their "toddler" art.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
I see where you are coming from...
Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.
You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.
It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.
- normal image editing response: 12-14s
- image editing response with Claude meta-prompting: 20-25s
- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s
(I use Replicate though, so the actual API may be much faster).
This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.
I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)
I keep hearing advocates of AI video generation talking at length about how easy the tools are to use and how great the results are, but I've yet to see anyone produce something meaningful that's coherent, consistent, and doesn't look like total slop.
You need talented people to make good stuff, but at this time most of them still fear the new tools.
> Bots in the Hall
* voices don't match the mouth movements * mouth movements are poorly animated * hand/body movements are "fuzzy" with weird artifacts * characters stare in the wrong direction when talking * characters never move * no scenes over 3 seconds in length between cuts
> Neural Viz
* animations and backgrounds are dull * mouth movements are uncanny * "dead eyes" when showing any emotions * text and icons are poorly rendered
> The Meat Dept video for Igorrr's ADHD
This one I can excuse a bit since it's a music video, and for the sake of "artistic interpretation", but:
* continuation issues between shots * inconsistent visual style across shots * no shots longer then 4 seconds between cuts * rendered text is illegible/nonsensical * movement artifacts
Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.
Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!
> i was to bring awareness to the dangers of dressing up like a seal while surfboarding (ie. wearing black wetsuites, arms hanging over the board). Create a scene from the perspective of a shark looking up from the bottom of the ocean into a clear blue sky with silhouettes of a seal and a surfer and fishing boat with line dangling in the water and show how the shark contemplates attacking all these objects because they look so similiar.
I havnt found a model yet that can process that description, or any varition, into a scene that usable and makes sense visually to anyone older the a 1st grader. They will never place the seal, surfer, shark or boat in the correct location to make sense visually. Typically everyone is under water, sizing of everything is wrong. You tell them to the image is wrong, to place the person ontop of the water, and they cant. Please can someone link to a model that is capable or tell me what i am doing wrong? How can you claim to process words into images in a repeatable way when these systems cant deal with multiple contraints at once?
My intention was solely to support the parent in the face of prevalent general critique of what he dabbles in.
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?
I made a video about building that here: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
It works much better with Claude Code and Codex CLI because they don't mess around with scrolling in the same way as Gemini CLI does.
I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!
I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.
It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.
I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.
Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.
They're really torturing their poor models over there.
Adobe's conference last week points to the future of image gen. Visual tools where you mold images like clay. Hands on.
Comfy appeals to the 0.01% that like toolkits like TouchDesigner, Nannou, and ShaderToy.
If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.
I was trying to create a simple "mascot logo" for my pet project. I first created an account on Kittl [0] and even paid for one month but it was quite cumbersome to generate images until I figured out I could just use the nano banana api myself.
Took me 4 prompts to ai-slop a small python script I could run with uv that would generate me a specified amount of images with a given prompt (where I discovered some of the insight the author shows in their post). The resulting logo [1] was pretty much what I imagined. I manually added some text and played around with hue/saturation in Kittl (since I already paid for it :)) et voilà.
Feeding back the logo to iterate over it worked pretty nicely and it even spit out an "abstract version" [2] of the logo for favicons and stuff without a lot of effort.
All in all this took me 2 hours and around 2$ (excluding the 1 month Kittl subscription) and I would've never been able to draw something like that in Illustrator or similar.
[0] https://www.kittl.com/ [1] https://github.com/sidneywidmer/yass/blob/master/client/publ... [2] https://github.com/sidneywidmer/yass/blob/master/client/publ...
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.
> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.
This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
That is why I always call technical writers "documentation engineers," why I call diplomats "international engineers," why I call managers "team engineers," and why I call historians "hindsight engineers."
Despite needing much knowledge of how a planes inner workings function, a pilot is still a pilot and not an aircraft engineer.
Just because you know how human psychology works when it comes to making purchase decision and you are good at applying that to sell things, you're not a sales engineer.
Giving something a fake name, to make it seem more complicated or aspirational than it actually is makes you a bullshit engineer in my opinion.
So Prompt Philosopher/Communicator?
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
I give it,
Reposition the text bubble to be coming from the middle character.
DO NOT modify the poses or features of the actual characters.
Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again: Place a speech bubble on the image. The "tail" of the bubble should make it appear that the middle (red-headed) girl is talking. The speech bubble should read "Hide the vodka." Use a Comic Sans like font. DO NOT place the bubble on the right.
DO NOT modify the characters in the image.
There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.
You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.
(¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)
For anything, even back in the "classical" search days.
"This got searched verbatim, every time"
W*ldcards were handy
and so on...
Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).
Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.
Discounting the testing around the character JSON which became extremely expensive due to extreme iteration/my own stupidity, I'd wager it took about $5 total including iteration.
This is a very different fuzzy interface compared to programming languages.
There will be techniques better or worse at interfacing.
This is what the term prompt engineering is alluding to since we don’t have the full suite of language to describe this yet.
now you can really use natural language and people want to debate you about how poor they are at articulating a shared concepts, amazing
it's like the people are regressing and the AI is improving
I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.
That sounds interesting. Could you share?
A 1024x1024 image seems to cost about 3ct to generate.
I had no idea that the context window was so large. I’d been instinctively keeping my prompts small because of experience with other models. I’m going to try much more detailed prompts now!
AI can't do that (yet?).
Yet when I ask some simple tasks to it, like doing a 16:9 picture sized image instead of a square one, it ends up doing a 16:9 on a white background that matches a square.
When I ask it to make it with text, then on the second request to redo while changing just a certain visual element, it ends up breaking the previously asked text.
It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
Not (knowingly) used an llm for a long time. Is the above true?
Very cool post, thanks for sharing!
- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
Things like: Convert the people to clay figures similar to what one would see in a claymation.
And it would think it did it, but I could not perceive any change.
After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.
[0] https://www.lux.camera/content/images/size/w1600/2024/09/IMG...
Looks like specific f-stops don't actually make a difference for stable diffusion at least: https://old.reddit.com/r/StableDiffusion/comments/1adgcf3/co...
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
But it is still generating it with a prompt
> Logo: "A simple, modern logo with the letters 'G' and 'A' in a white circle.
My idea was do to it manually so that there is no probabilities involved.
Though your idea of using python is same.
"YOU WILL BE PENALIZED FOR USING THEM"
That is disconcerting.
www.brandimagegen.com
if you want a premium account to try out, you can find my email in my bio!!
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
(Do we say we software engineered something?)
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.