Generative AI December 23, 2025

Meta reportedly plans Mango image-video model and Avocado coding model for 2026

Meta is reportedly building two new flagship models for a first-half 2026 release: Mango, an image and video model, and Avocado, a text model aimed at coding. The details come from internal remarks reported by The Wall Street Journal. If the report i...

Meta reportedly plans Mango image-video model and Avocado coding model for 2026

Meta’s next AI bet looks a lot like OpenAI’s, with a sharper focus on video and code

Meta is reportedly building two new flagship models for a first-half 2026 release: Mango, an image and video model, and Avocado, a text model aimed at coding. The details come from internal remarks reported by The Wall Street Journal. If the report is accurate, Meta is shifting away from generic chatbot catch-up and toward multimodal generation, visual reasoning, and code-heavy agents.

That’s a sensible place to compete. It’s also expensive, messy, and easy to oversell.

Meta’s AI story has felt scattered for a while. Big hiring, reorgs, talent churn. Meta AI reaches a huge audience because it sits inside Facebook, Instagram, WhatsApp, and Messenger, but distribution and technical leadership are different things. If Mango and Avocado ship as described, Meta is trying to close that gap with two products aimed at markets that still have room to move: high-end video generation and coding systems that can work across real software projects.

Why Mango matters

Another image model doesn’t move the market much in 2026. Video still does.

The hard part in video generation is consistency over time. Characters drift. Lighting changes for no clear reason. Motion gets rubbery. Objects disappear between frames. That’s why the interesting part of the Mango report isn’t the phrase “image and video model.” It’s Meta’s reported interest in world models that can understand visual input, reason over it, and plan actions without seeing every edge case during training.

That suggests a system with ambitions beyond prompt-to-video demos.

A plausible Mango stack looks pretty familiar on the generative side. Think a latent-space system built around a diffusion transformer or related flow-based architecture, with video compressed through a learned tokenizer such as a VAE or VQ scheme. Raw video is a memory disaster. Tokenizing frames lets the model run spatiotemporal attention over compressed representations instead of brute-forcing full-resolution sequences.

That still leaves the consistency problem. Meta likely needs some mix of:

  • temporal conditioning to track identity and scene state across long sequences
  • motion-aware losses so movement stays physically plausible
  • representation learning that teaches structure before style

That last point matters. Meta has spent years pushing JEPA-style predictive learning, where a model learns latent structure by predicting missing or future representations instead of pixels. If Mango picks up that approach, it could be stronger on perception and planning than text-to-video systems tuned mostly for cinematic output.

Engineers should still be skeptical of the phrase world model. It has become a catch-all. In practice, it usually means a model learns a compact latent simulation of an environment and uses that state for forecasting, planning, or action selection. In robotics or games, that’s fairly concrete. In consumer AI products, it gets vague fast. Meta’s near-term use is probably closer to software agents that can interpret app screens, UI states, and rendered environments.

That would already be a real improvement over current multimodal assistants, which can see images but still struggle to reason through visual state reliably.

Avocado is easier to place

Meta’s coding model is easier to understand and probably easier to sell.

If Avocado is meant to compete seriously, the target is no longer single-function completion. Copilot, Claude Code, OpenAI’s coding stack, and a pile of smaller tools already do that. A 2026 coding model has to work across files, understand repo context, run tools, execute tests, and recover when its first guess is wrong.

That points to a standard but still demanding architecture: a large decoder-only transformer, probably with MoE routing to keep inference costs under control, trained on filtered code corpora, commit histories, issue threads, docs, and internal repositories. The data mix matters as much as the parameter count. Coding models get better when they see not just final code, but the path to final code: diffs, bug reports, failed tests, code review comments, dependency upgrades.

The model is only half the product. The rest is the loop around it.

A serious coding system in 2026 needs:

  • sandboxed code execution
  • unit test generation and repair cycles
  • retrieval over internal docs and APIs
  • calls to linters, static analyzers, package registries, and security scanners
  • long-context handling that holds up under multi-file edits

So when vendors say a model is “better at coding,” the useful question is whether it uses tools well. Raw next-token prediction still matters. The practical gains come from execution-aware generation. The model writes code, runs it, sees failures, revises, and proposes a patch that survives contact with reality. Without that loop, benchmark scores can flatter badly.

Watch the usual tests like HumanEval, MBPP, and RepoBench, but pay more attention to repo-level change sets, patch acceptance rates, runtime error frequency, and how often the system can complete a migration without human cleanup. Those are the numbers engineering leaders care about.

A broader agent stack

Taken together, Mango and Avocado look like parts of a broader system.

Mango would handle perception and generation across images and video. Avocado would handle language, code, and probably tool orchestration. Add planning, memory, and execution, and you have the outline of an agent platform.

That lines up with where the market has landed after two years of inflated agent claims. Pure language agents are brittle. They hallucinate state, lose context, and struggle with interfaces. Better systems split the work:

  • one model or encoder sees the world
  • one model reasons over text, code, and plans
  • a controller decides which tools to call
  • evaluators or critics check whether the result is valid

Meta’s reported direction fits that pattern. So do Meta’s strengths. The company has distribution, massive infrastructure, consumer products full of image and video data, and enough internal software complexity to justify a coding assistant built for real engineering workflows. What it doesn’t have is the same developer mindshare as OpenAI or Anthropic. Avocado looks aimed squarely at that problem.

Where the costs pile up

Video training is brutal. Storage, bandwidth, curation, labeling, filtering, safety review, rights management, eval pipelines. Every part gets harder once you move past static images. If Mango is serious, Meta needs a large clean video corpus and a credible story on licensing and provenance.

Inference is expensive too. High-quality video generation burns compute in a way text products mostly don’t. Even with latent compression, sparse attention, quantization, and fast kernels like FlashAttention-3, serving video at scale is a very different economic problem from serving chat completions.

Developers building on these systems should expect vendor APIs to hide most of that complexity while passing through plenty of the cost.

For coding, the bottleneck is less about generation cost and more about workflow integration. A model that edits code but can’t access your build graph, CI checks, API schemas, package policies, and security rules won’t earn much trust. Enterprises will want sandboxes, audit logs, access controls, SBOM checks, license scanning, and secrets detection before they let a coding agent touch production repositories.

That’s standard due diligence. Generated code that compiles is easy. Generated code that meets policy is harder.

What teams should do now

If you build developer tools, creative tools, or internal AI platforms, Meta’s reported roadmap is a decent signal. More capable multimodal and code-centric APIs are coming, whether from Meta or someone else.

Treat prompts like configs

For image and video workflows, free-form prompting runs out of road quickly. Teams need structured controls for character identity, camera movement, style constraints, shot continuity, and asset references. If Mango ships with better video coherence, the teams that benefit most will be the ones with solid production pipelines.

Store richer multimodal context

Video systems improve when you keep frame-level metadata, captions, embeddings, timestamps, and object tags. If your data layer still treats media as blobs with filenames, that’s going to hurt.

Build evals around actual failures

For video, that means temporal consistency, identity preservation, motion realism, and human preference tests. For code, it means patch acceptance, test pass rates, security findings, and latency under repo-scale context. Benchmark theater is cheap. Operational evaluation takes work and pays off.

Plan for provenance

If Mango produces convincing media, enterprises will ask for C2PA signing, watermark handling, moderation controls, and audit trails. Fair enough. Synthetic media without provenance turns into a governance problem quickly.

What Meta still has to prove

Meta can build strong models. That part isn’t in doubt.

The harder question is whether it can turn those models into products that developers and businesses choose on merit, not because they’re preloaded inside social apps. Open-weight credibility helps. Massive distribution helps too. Neither one guarantees a strong coding assistant or a dependable video platform.

Mango could matter if it narrows the gap between video generation and actual visual reasoning. Avocado could matter if it works like a software engineer with tools, not a chatbot with syntax highlighting.

That’s the bar now. Anything short of that disappears into the next benchmark cycle.

What to watch

The limitation is that creative output quality is only one part of adoption. Rights, review workflows, brand control, and editability matter just as much. Teams should separate impressive generation from repeatable production use.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
Data science and analytics

Turn data into forecasting, experimentation, dashboards, and decision support.

Related proof
Growth analytics platform

How a growth analytics platform reduced decision lag across teams.

Related article
Google launches Nano Banana Pro on Gemini 3 for team image workflows

Google has released Nano Banana Pro, a new image generation model built on Gemini 3. The notable part is where Google seems to want this used. This is aimed at work teams actually ship. The upgrades are practical. Better text rendering across languag...

Related article
Meta Llama 4 and the new case for open-weight multimodal AI

Meta’s Llama project has moved beyond the role of “open alternative.” With Llama 4, released in April 2025 and now widely available across cloud platforms, Meta is offering a mix many engineering teams have wanted for years: open weights, native mult...

Related article
Google Maps adds Gemini for conversational navigation and hands-free search

Google is pushing Gemini deeper into Maps, and this one looks useful. The update adds conversational help while driving, landmark-based directions, proactive traffic alerts, and a Lens-powered visual Q&A mode for places around you. That puts Maps in ...