What maximum resolution does Nano Banana Pro support?

Nano Banana Pro supports 1080p, 2K, and 4K image outputs.

How many reference images can I use per generation?

You can include up to six high-fidelity reference images per generation.

What is the cost per image for Nano Banana Pro?

It costs $0.139 per 1080p/2K image and $0.24 per 4K image.

Generative AI November 20, 2025

Google launches Nano Banana Pro on Gemini 3 for team image workflows

Google has released Nano Banana Pro, a new image generation model built on Gemini 3. The notable part is where Google seems to want this used. This is aimed at work teams actually ship. The upgrades are practical. Better text rendering across languag...

Google’s Nano Banana Pro looks like a serious image model for production work

Google has released Nano Banana Pro, a new image generation model built on Gemini 3. The notable part is where Google seems to want this used. This is aimed at work teams actually ship.

The upgrades are practical. Better text rendering across languages. Camera and lighting controls that sound closer to creative tooling than prompt gambling. Web search inside the generation flow. Support for multiple reference images, object-heavy scenes, and identity consistency across several people. The pricing makes the intent clear too: this is a premium model for final assets, not a cheap sandbox.

That matters because image generation has had the same problem for the last two years. Models got very good at style, mood, and visual novelty. They stayed unreliable at the dull, necessary stuff: readable text, stable layouts, consistent faces, and outputs that hold up when a team tries to turn one good result into a repeatable pipeline.

Google is trying to fix that.

What Google shipped

Nano Banana Pro extends the original Nano Banana in a few specific ways:

Resolution goes beyond the old 1024x1024 ceiling to 1080p, 2K, and 4K
Text rendering is better across styles, fonts, and languages
Web search can feed the model facts and references for grounded visual outputs
Scene controls include camera angle, lighting, focus, depth of field, and color grading
Compositional control covers up to 14 objects
Reference handling supports up to six high-fidelity reference shots
Identity consistency can maintain resemblance for up to five people

The model is available through the Gemini API, Google AI Studio, and Google’s newer Antigravity IDE. In the Gemini app, Google is making Nano Banana Pro the default, though free users only get a limited number of generations before they fall back to the original model.

Pricing is the clearest signal that Google sees this as production software:

$0.139 per 1080p or 2K image
$0.24 per 4K image

That’s expensive if you're generating throwaway concepts all day. It’s fine if you're producing approved marketing assets, localized storefront images, ecommerce product visuals, or in-app graphics where one bad output costs more than the render.

Why text rendering matters more than the 4K headline

Plenty of vendors can claim high resolution now. Clean text is still hard.

Diffusion models have historically been weak at typography for the same reason they’re good at painterly image synthesis. They predict pixels well enough to suggest letters, but not reliably enough to handle kerning, baseline alignment, small text, or non-Latin scripts. That’s why so many generated posters and product banners still fall apart into warped nonsense once the text gets dense or multilingual.

If Google has actually improved this, the use cases change:

promo banners with actual product names
educational cards with readable labels
multilingual ads and storefront assets
UI mock visuals with text that isn’t embarrassing
diagrams and infographics that don’t need manual cleanup every time

That matters more than 4K output. Upscaling is a solved problem. Broken typography poisons a workflow.

For engineering teams, better text rendering cuts down on handoffs. If designers still have to rebuild every generated visual in Figma because the lettering is unusable, the model stays stuck in ideation. If the text survives, it can move into automated asset pipelines.

The likely architecture

Google hasn’t published a full technical paper yet, but the product shape is easy enough to read.

This looks like a controller-model stack. Gemini 3 likely handles high-level reasoning, scene planning, retrieval, and prompt expansion. A diffusion-based image model does the rendering. That split makes sense. LLMs are good at structured decomposition and dense constraints. Image models are good at synthesis once those constraints are explicit.

A request like “create a recipe flash card with accurate ingredient quantities, warm kitchen lighting, readable Arabic and English text, and a 35mm editorial photo style” involves several separate jobs:

retrieve or verify factual content
turn vague style language into scene parameters
plan layout and text placement
preserve script-specific typography rules
generate the image with all of that conditioning

A plain diffusion model won’t handle that cleanly on its own.

The controls Google exposes also suggest stronger conditioning paths. When vendors mention camera angle, depth of field, and lighting direction, there are usually two possibilities. One is a richer constraint mechanism, similar in spirit to ControlNet. The other is some kind of learned 3D-aware latent representation that helps the model respect viewpoint and focal properties. Google may not be doing explicit 3D reconstruction, but the outputs suggest stronger geometric priors than a generic text-to-image setup.

The identity consistency feature points to another likely ingredient: identity embeddings extracted from reference images and fed back into generation. That’s the standard way to reduce face drift across shots. Google’s claim of keeping up to five people consistent is ambitious. It needs testing because multi-person consistency usually breaks once poses, lighting, or occlusion get messy.

Web search inside image generation is useful and risky

The web search feature is one of the more interesting additions because a lot of image prompts are really data requests in disguise.

People ask for a study card, recipe board, product explainer, or travel graphic and expect the image to contain current facts. If the model can fetch that information first, summarize it, and then render the result, the output gets grounded instead of decorative.

Useful, yes. Safe by default, no.

Any system that mixes retrieval with generation can absorb junk from the web, including prompt injection, hidden instructions, bad source data, and licensing problems. A page doesn’t have to look malicious to poison a generation chain. For regulated or brand-sensitive workflows, open-web retrieval is probably the wrong default. Internal RAG over approved content is safer and easier to audit.

Senior teams should treat this the same way they treat tool-using LLMs elsewhere: sanitize inputs, use allowlists, keep provenance on retrieved data, and don’t assume the search layer is harmless because the output is “just an image.”

Pricing points to tiered workflows

The cost structure is sensible, but it matters.

At $0.24 per 4K image, a batch of 100 finals costs $24. That’s reasonable. It also means you probably don’t want Nano Banana Pro as your first-pass ideation engine unless the quality is worth the burn. Most teams will end up with a layered pipeline:

cheap model for broad exploration
Nano Banana Pro at 2K for review rounds
Nano Banana Pro at 4K for approved finals

That pattern is already common in text generation and video. It’s becoming standard in image stacks too.

Runtime matters as much as price. If a model is heavier, your throughput planning changes. Cache seeds. Save prompt templates. Store control parameters. Keep references versioned. If you need reproducibility for a campaign refresh or localization pass, that metadata becomes part of the asset.

Google still has a provenance gap

Google is integrating SynthID detection into the Gemini app for AI-generated image watermarking. That’s useful, within limits. SynthID survives common edits reasonably well, but watermark detection is not the same as broad provenance interoperability.

The missing piece is C2PA support, or at least any explicit mention of it.

That matters because enterprise buyers increasingly want standards-based content credentials that move across tools and vendors. Adobe has pushed harder here. Google hasn’t matched that part yet. If your pipeline spans multiple creative and publishing systems, SynthID alone won’t solve provenance or compliance. You’ll need parallel tooling until Google supports wider standards.

Where this leaves Google

The field is sorting itself by strengths.

Midjourney still has an aesthetic edge for a lot of stylized output. Adobe Firefly has the stronger enterprise story around creative workflows and provenance standards. OpenAI’s DALL·E 3 pushed text rendering forward earlier than most.

Google’s angle is different. It’s pairing a strong model with an LLM controller, retrieval, API access, and its own developer stack. That’s a coherent strategy. If it works in practice, Google has a better shot at winning actual pipelines instead of benchmark chatter and social demos.

That’s where the money is.

What developers should test first

If you're evaluating Nano Banana Pro for a real team, don’t start with “make a cool poster.” Start with the annoying cases:

multilingual text, especially Arabic, CJK, and Devanagari
multi-person reference consistency under pose changes
object-heavy scenes with overlapping elements
deterministic reruns using saved parameters
retrieval safety with web search enabled
latency and cost under batch load
brand compliance and editability after generation

A strong prompt won’t save a weak pipeline. You still need templates, reference management, approval steps, and provenance handling around the model.

Google’s pitch is credible because it lines up with those needs. The model will still miss on edge cases. They all do. But this release points in a better direction: less generative party trick, more production tool. For teams waiting for image models to grow up a bit, that’s the part worth watching.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI video automation

Automate repetitive creative operations while keeping review and brand control intact.

Related proof

AI video content operations

How content repurposing time dropped by 54%.

Google Gemini adds native image generation and editing in chat

Google has added native image creation and editing to Gemini chat. You can upload a photo, generate a new image, and keep refining it through follow-up prompts. Change the background. Recolor an object. Add or remove elements. Keep working on the sam...

Figma acquires Weavy and rebrands its AI media tools as Figma Weave

Figma has acquired Weavy, a Tel Aviv startup building AI image and video generation tools, and is rebranding the product as Figma Weave. Roughly 20 people are joining Figma. For now, Weave stays a standalone product before deeper integration lands in...

Google adds native image generation and editing to Gemini 2.5 Flash

Google has added native image generation and editing to Gemini 2.5 Flash, and the interesting part isn’t flashy demo art. It’s precise editing that leaves the rest of the image alone. The rollout covers the Gemini app, the Gemini API, Google AI Studi...