Mirelo raises $41M to fix the audio gap in AI video generation
AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all. Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz ...
Mirelo raises $41M to fix the most obvious flaw in AI video: it still sounds dead
AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all.
Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz to work on that gap. Its models watch a video, parse what happens frame by frame, and generate synced sound effects. Footsteps, fabric movement, engine noise, impacts, room tone, weather. The boring pieces that make a scene feel physical.
That focus makes sense. A lot of AI audio startups are trying to cover voice, music, dubbing, and sound design in one shot. Mirelo is starting with Foley and spot effects. For now, that looks like the smarter wedge. Video-to-SFX is hard, but it’s still a narrower problem than music, and users notice the absence immediately.
Why this matters now
The surge in AI-generated video has created a pretty specific production headache. Teams can crank out decent visuals quickly, but audio still means manual editing, stock libraries, and timing work. That slows down the whole loop.
If you’re shipping short-form video, previs, game cutscenes, or ad variants, silent clips stand out for the wrong reason. Audio carries weight, distance, rhythm, and mood. A punch with no impact, a door with no latch sound, footsteps that miss the contact frame, and the scene starts to feel fake fast.
Mirelo CEO CJ Simon-Gabriel told TechCrunch that “Sound is 50% of the movie-going experience.” It’s a familiar line, but he’s right on the substance. The same visuals can read completely differently depending on the sound design, and most AI video systems have treated that layer as an afterthought.
What Mirelo is building
Mirelo released Mirelo SFX v1.5 earlier this year for video-to-SFX generation. The company has also put models on Fal.ai and Replicate, which is a sensible move for a small team. It gets developers using the product without forcing them into a full platform first.
The near-term business model looks pretty clear:
- API access for developers and studios
- A freemium tier for lighter use
- A creator plan around €20/month
- A forthcoming product called Mirelo Studio, likely a browser-based workspace for creators and editors
The company reportedly has around 10 people and plans to double or triple headcount in 2026 across research, product, and go-to-market.
$41 million is a big seed round, but it fits the market. Investors are still writing large checks for multimodal AI teams with a clear entry point, especially when the product can become an API business before it tries to grow into a full creative suite.
The hard part
A usable video-to-SFX system has to do at least four things well:
- Understand what happens in the clip
- Detect when each event happens
- Choose or synthesize the right sound
- Align the sound tightly enough that people don’t notice drift
That last part matters more than a lot of demo videos admit.
People will forgive some visual weirdness. They’re far less forgiving about sync. If a foot lands and the sound trails even a little, the illusion breaks. Teams evaluating products in this category should care less about vague claims around “audio quality” and more about onset timing, consistency, and whether outputs are editable.
A typical stack probably includes:
- A spatiotemporal video encoder such as
TimeSformer,ViViT, or a 3D CNN to read motion across frames - Event segmentation that marks actions with timestamps
- Cross-modal conditioning, potentially using embeddings in the style of
CLAP - Audio generation or audio-token decoding with models built around codecs like
EnCodecorDAC - A post-process layer for mixing, loudness, reverb, and spatial placement
Mirelo probably uses some mix of retrieval and generation. That’s the practical choice.
Pure generation makes for nice demos, but retrieval is still more reliable for common events. If the system sees a car door slam, a glass clink, or sneakers on pavement, using a high-quality library asset and adapting it is perfectly reasonable. Generation helps when the scene is unusual, the timing is odd, or the requested acoustic texture isn’t in the library.
That hybrid setup also helps on cost. Full waveform generation for every tiny event gets expensive quickly.
Specialization helps, up to a point
Mirelo’s pitch combines specialization with rights-aware data sourcing. The company says it trains on public and purchased sound libraries and is signing revenue-sharing deals with artists.
That matters for two reasons.
First, training data provenance is now a product risk, not a footnote. Audio startups that can explain where their training and retrieval assets come from have a cleaner story for enterprise buyers.
Second, good Foley data is messy. The problem isn’t collecting “door sounds.” It’s collecting enough variation with useful labels: wood versus metal, interior versus exterior, heavy door versus cabinet, close mic versus room mic, soft close versus slam. That taxonomy becomes part of the product.
If Mirelo has built a well-labeled, rights-aware SFX dataset, that’s valuable. Whether it stays a moat is less clear. Bigger companies can buy libraries, sign licensing deals, or fold video-to-sound into broader media stacks. Sony, Tencent, Kuaishou’s Kling AI, and ElevenLabs are already around this territory.
So yes, specialization helps. It won’t be enough on its own unless the product gets sticky.
What developers should watch before using the API
The demo version of this product is easy to understand. Upload a clip, get stems back. The production version is where it gets annoying.
Sync drift and input quality
Variable frame rate social video can throw sync off. Normalize inputs. mp4 or mov with a consistent frame rate is the safer option. If your pipeline ingests random creator uploads, add preprocessing to standardize fps and strip strange timing metadata before inference.
Audio output format
For real video work, 48 kHz should be the baseline. Separate stem export matters too. If a service only returns a flattened stereo mix, it’s much less useful in editing software and game pipelines.
Latency
Offline generation is fine for creators. Real-time or near-real-time use in games is a different problem.
Generating polished, tightly aligned SFX on the fly inside Unity or Unreal is still hard. You likely need a lightweight inference path, caching for common cues, or a hybrid setup where the model predicts events and a lower-latency engine handles playback. If a vendor claims “real-time AI Foley” for gameplay today, ask about frame budgets and platform targets.
Cost and serving
Video encoders plus audio decoders aren’t cheap to run. For higher-quality inference, A100 or H100 class GPUs make sense. For prosumer workloads, L4 and A10 can work if batching is tuned well.
Chunking long videos is standard practice. You need overlap between segments or you’ll end up with audible seams, broken ambience, and clipped events.
Controls
A serious product here needs more than a generate button. Editors want controls for surface type, intensity, environment, sync tightness, and spatial feel. A useful API should accept metadata like wood_floor, rain, wide_shot, or indoor_reverb and return something deterministic enough to iterate on.
Without that, it’s a novelty feature.
Where this lands first
The clearest early wins are in workflows where speed matters more than handcrafted sound design.
- UGC and social tools will probably adopt auto-SFX as a background feature.
- Previs and animation teams can use it to rough in sound much earlier.
- Ad production gets cheaper when every visual variant can ship with matching audio.
- Game studios will likely use it first for prototyping and content iteration, not final runtime playback.
That last point matters. In games, this looks better suited to authoring pipelines in the near term than live inference during gameplay.
The bet
Mirelo is betting that AI video will create a new baseline expectation: if something moves on screen, it should make sound, and that sound should hit on the right frame.
That’s a sensible bet. It also leaves the company exposed to bundling. If major AI media platforms turn synced SFX into a checkbox feature, standalone vendors will need better controls, better quality, cleaner licensing, or tighter workflow integration to stay relevant.
Still, this is one of the more grounded AI media startups in the market. The problem is real. The product direction makes sense. And unlike a lot of multimodal demo bait, this solves something users notice the second it’s missing.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Automate repetitive creative operations while keeping review and brand control intact.
How content repurposing time dropped by 54%.
Character.AI has unveiled AvatarFX, a video generation model built to animate chatbot characters from either text prompts or still images. It's in closed beta for now. The pitch is simple: take a static avatar, give it a script, and render a speaking...
Figma has acquired Weavy, a Tel Aviv startup building AI image and video generation tools, and is rebranding the product as Figma Weave. Roughly 20 people are joining Figma. For now, Weave stays a standalone product before deeper integration lands in...
Runway has raised a $315 million Series E at a $5.3 billion valuation, with General Atlantic leading and Nvidia, Fidelity, AllianceBernstein, Adobe Ventures, AMD Ventures, Felicis, and others participating. The headline number is large. The more inte...