What sound effects can Mirelo generate?

Footsteps, fabric movement, engine noise, impacts, room tone, weather, and other Foley spot effects.

How can I access Mirelo’s technology?

Via an API with a freemium tier, a €20/month creator plan, or the upcoming browser-based Mirelo Studio.

Generative AI December 17, 2025

Mirelo raises $41M to fix the audio gap in AI video generation

Q: Who backed Mirelo’s $41M seed round?

Index Ventures and Andreessen Horowitz.

AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all. Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz ...

Mirelo raises $41M to fix the most obvious flaw in AI video: it still sounds dead

AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all.

Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz to work on that gap. Its models watch a video, parse what happens frame by frame, and generate synced sound effects. Footsteps, fabric movement, engine noise, impacts, room tone, weather. The boring pieces that make a scene feel physical.

That focus makes sense. A lot of AI audio startups are trying to cover voice, music, dubbing, and sound design in one shot. Mirelo is starting with Foley and spot effects. For now, that looks like the smarter wedge. Video-to-SFX is hard, but it’s still a narrower problem than music, and users notice the absence immediately.

Why this matters now

The surge in AI-generated video has created a pretty specific production headache. Teams can crank out decent visuals quickly, but audio still means manual editing, stock libraries, and timing work. That slows down the whole loop.

If you’re shipping short-form video, previs, game cutscenes, or ad variants, silent clips stand out for the wrong reason. Audio carries weight, distance, rhythm, and mood. A punch with no impact, a door with no latch sound, footsteps that miss the contact frame, and the scene starts to feel fake fast.

Mirelo CEO CJ Simon-Gabriel told TechCrunch that “Sound is 50% of the movie-going experience.” It’s a familiar line, but he’s right on the substance. The same visuals can read completely differently depending on the sound design, and most AI video systems have treated that layer as an afterthought.

What Mirelo is building

Mirelo released Mirelo SFX v1.5 earlier this year for video-to-SFX generation. The company has also put models on Fal.ai and Replicate, which is a sensible move for a small team. It gets developers using the product without forcing them into a full platform first.

The near-term business model looks pretty clear:

API access for developers and studios
A freemium tier for lighter use
A creator plan around €20/month
A forthcoming product called Mirelo Studio, likely a browser-based workspace for creators and editors

The company reportedly has around 10 people and plans to double or triple headcount in 2026 across research, product, and go-to-market.

$41 million is a big seed round, but it fits the market. Investors are still writing large checks for multimodal AI teams with a clear entry point, especially when the product can become an API business before it tries to grow into a full creative suite.

The hard part

A usable video-to-SFX system has to do at least four things well:

Understand what happens in the clip
Detect when each event happens
Choose or synthesize the right sound
Align the sound tightly enough that people don’t notice drift

That last part matters more than a lot of demo videos admit.

People will forgive some visual weirdness. They’re far less forgiving about sync. If a foot lands and the sound trails even a little, the illusion breaks. Teams evaluating products in this category should care less about vague claims around “audio quality” and more about onset timing, consistency, and whether outputs are editable.

A typical stack probably includes:

A spatiotemporal video encoder such as TimeSformer, ViViT, or a 3D CNN to read motion across frames
Event segmentation that marks actions with timestamps
Cross-modal conditioning, potentially using embeddings in the style of CLAP
Audio generation or audio-token decoding with models built around codecs like EnCodec or DAC
A post-process layer for mixing, loudness, reverb, and spatial placement

Mirelo probably uses some mix of retrieval and generation. That’s the practical choice.

Pure generation makes for nice demos, but retrieval is still more reliable for common events. If the system sees a car door slam, a glass clink, or sneakers on pavement, using a high-quality library asset and adapting it is perfectly reasonable. Generation helps when the scene is unusual, the timing is odd, or the requested acoustic texture isn’t in the library.

That hybrid setup also helps on cost. Full waveform generation for every tiny event gets expensive quickly.

Specialization helps, up to a point

Mirelo’s pitch combines specialization with rights-aware data sourcing. The company says it trains on public and purchased sound libraries and is signing revenue-sharing deals with artists.

That matters for two reasons.

First, training data provenance is now a product risk, not a footnote. Audio startups that can explain where their training and retrieval assets come from have a cleaner story for enterprise buyers.

Second, good Foley data is messy. The problem isn’t collecting “door sounds.” It’s collecting enough variation with useful labels: wood versus metal, interior versus exterior, heavy door versus cabinet, close mic versus room mic, soft close versus slam. That taxonomy becomes part of the product.

If Mirelo has built a well-labeled, rights-aware SFX dataset, that’s valuable. Whether it stays a moat is less clear. Bigger companies can buy libraries, sign licensing deals, or fold video-to-sound into broader media stacks. Sony, Tencent, Kuaishou’s Kling AI, and ElevenLabs are already around this territory.

So yes, specialization helps. It won’t be enough on its own unless the product gets sticky.

What developers should watch before using the API

The demo version of this product is easy to understand. Upload a clip, get stems back. The production version is where it gets annoying.

Sync drift and input quality

Variable frame rate social video can throw sync off. Normalize inputs. mp4 or mov with a consistent frame rate is the safer option. If your pipeline ingests random creator uploads, add preprocessing to standardize fps and strip strange timing metadata before inference.

Audio output format

For real video work, 48 kHz should be the baseline. Separate stem export matters too. If a service only returns a flattened stereo mix, it’s much less useful in editing software and game pipelines.

Latency

Offline generation is fine for creators. Real-time or near-real-time use in games is a different problem.

Generating polished, tightly aligned SFX on the fly inside Unity or Unreal is still hard. You likely need a lightweight inference path, caching for common cues, or a hybrid setup where the model predicts events and a lower-latency engine handles playback. If a vendor claims “real-time AI Foley” for gameplay today, ask about frame budgets and platform targets.

Cost and serving

Video encoders plus audio decoders aren’t cheap to run. For higher-quality inference, A100 or H100 class GPUs make sense. For prosumer workloads, L4 and A10 can work if batching is tuned well.

Chunking long videos is standard practice. You need overlap between segments or you’ll end up with audible seams, broken ambience, and clipped events.

Controls

A serious product here needs more than a generate button. Editors want controls for surface type, intensity, environment, sync tightness, and spatial feel. A useful API should accept metadata like wood_floor, rain, wide_shot, or indoor_reverb and return something deterministic enough to iterate on.

Without that, it’s a novelty feature.

Where this lands first

The clearest early wins are in workflows where speed matters more than handcrafted sound design.

UGC and social tools will probably adopt auto-SFX as a background feature.
Previs and animation teams can use it to rough in sound much earlier.
Ad production gets cheaper when every visual variant can ship with matching audio.
Game studios will likely use it first for prototyping and content iteration, not final runtime playback.

That last point matters. In games, this looks better suited to authoring pipelines in the near term than live inference during gameplay.

The bet

Mirelo is betting that AI video will create a new baseline expectation: if something moves on screen, it should make sound, and that sound should hit on the right frame.

That’s a sensible bet. It also leaves the company exposed to bundling. If major AI media platforms turn synced SFX into a checkbox feature, standalone vendors will need better controls, better quality, cleaner licensing, or tighter workflow integration to stay relevant.

Still, this is one of the more grounded AI media startups in the market. The problem is real. The product direction makes sense. And unlike a lot of multimodal demo bait, this solves something users notice the second it’s missing.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI video automation

Automate repetitive creative operations while keeping review and brand control intact.

Related proof

AI video content operations

How content repurposing time dropped by 54%.

Character.AI introduces AvatarFX, a video model for animating chatbot avatars

Character.AI has unveiled AvatarFX, a video generation model built to animate chatbot characters from either text prompts or still images. It's in closed beta for now. The pitch is simple: take a static avatar, give it a script, and render a speaking...

Figma acquires Weavy and rebrands its AI media tools as Figma Weave

Figma has acquired Weavy, a Tel Aviv startup building AI image and video generation tools, and is rebranding the product as Figma Weave. Roughly 20 people are joining Figma. For now, Weave stays a standalone product before deeper integration lands in...

Runway raises $315M at a $5.3B valuation as world models become the real bet

Runway has raised a $315 million Series E at a $5.3 billion valuation, with General Atlantic leading and Nvidia, Fidelity, AllianceBernstein, Adobe Ventures, AMD Ventures, Felicis, and others participating. The headline number is large. The more inte...