Generative AI November 11, 2025

Kaltura buys eSelf for $27M to add real-time conversational avatars

Kaltura is paying $27 million for eSelf.ai, an Israeli startup that builds real-time conversational avatars. By big tech standards, that's a small deal. For enterprise software, it still matters. Kaltura already has a sizable video business, with rou...

Kaltura buys eSelf for $27M to add real-time conversational avatars

Kaltura buys eSelf.ai for $27M and bets that enterprise video should talk back

Kaltura is paying $27 million for eSelf.ai, an Israeli startup that builds real-time conversational avatars. By big tech standards, that's a small deal. For enterprise software, it still matters.

Kaltura already has a sizable video business, with roughly $180 million in revenue and more than 800 enterprise customers. eSelf brings something Kaltura would rather buy than spend years building: live avatars that can listen, speak, read on-screen context, and respond fast enough to feel conversational.

That last part is where this gets interesting. Plenty of companies can generate an avatar reading from a script. Far fewer can drop one into training portals, webinars, support flows, or internal knowledge systems without it feeling slow, brittle, or fake.

The pitch is easy to see. Video stops being a passive asset and starts acting like an interface for sales, support, onboarding, and education. If Kaltura pulls off the integration, it moves beyond being a video platform and closer to an enterprise agent layer with a face.

Why Kaltura bought a company instead of building a feature

eSelf is young. It was founded in 2023 by CEO Alan Bekker and CTO Eylon Shoshan. Bekker previously founded Voca, which Snap acquired in 2020. That background matters. Snap spent years on voice AI and camera-based interaction, and this team has worked on the hard parts: low-latency speech, conversational NLP, and computer vision.

Kaltura says eSelf's team, about 15 AI engineers, is joining the company. That may matter as much as the product. Real-time multimodal systems are painful to build. The hard part isn't only model quality. It's orchestration, latency control, browser weirdness, streaming infrastructure, and the ugly edge cases that show up the second users interrupt, switch languages, or share a cluttered screen full of PDFs and dashboards.

Buying a team that already knows where those problems hide is often cheaper than assembling one from scratch and losing two years.

The strategic fit is straightforward. Kaltura already owns the distribution: webinars, learning environments, corporate video portals, and streaming products. eSelf gives it an interface layer that can sit inside those products and act as a guide, tutor, rep, or support agent. That's a much cleaner route than a startup trying to fight its way through enterprise procurement alone.

The stack matters more than the avatar

Under the avatar wrapper, Kaltura is buying a real-time multimodal pipeline.

A system like this usually starts in the browser or mobile app. Audio is captured and streamed, probably over WebRTC, because delay piles up fast any other way. If screen context is part of the interaction, you also need a side channel for structured state, screenshots, UI metadata, or OCR results. WebSocket or gRPC can handle the control plane, but the media path has to stay tight.

Then comes streaming ASR. For a natural exchange, the model can't wait for the speaker to finish a full sentence. It needs to emit partial hypotheses every few dozen milliseconds so the system can start planning a response before the user is done talking. That usually points to conformer or transducer-style speech models, quantized and tuned to run efficiently on GPUs or decent CPUs, often through ONNX Runtime or TensorRT.

The latency budget is unforgiving. If speech-to-text alone takes 400 ms, the whole interaction starts to drag.

Next is dialog orchestration, where a lot of avatar demos fall apart. You need an LLM, obviously. You also need a policy layer that decides when to call tools, when to query internal docs, when to stay quiet, and when to hand off. In enterprise software, that layer matters more than the base model. It's what separates a useful system from a compliance problem.

Screen understanding makes things harder. eSelf says its avatars can read the user's screen and react to context. That implies some mix of OCR, UI-tree parsing, and vision-language inference. The job is turning messy visual state into something the dialog engine can actually use. "Billing tab is open" is useful. "The screen contains lots of text" isn't.

That context also expands the attack surface. If an agent is reading arbitrary text from a web page, PDF, or internal app, prompt injection can come straight from the UI. Screen context can't be treated as trusted input. It needs the same sanitization and policy checks as any external content.

Then there's text-to-speech and rendering. For the interaction to feel live, TTS has to stream too, ideally with first audio in under 200 ms. The avatar renderer has to stay in sync closely enough that lip motion doesn't drift into the uncanny valley. Around 40 ms of phoneme-to-viseme slop is enough for people to notice, even if they can't explain why it feels off.

That's the engineering problem in plain terms: every stage has to stream, and every stage has to fail gracefully.

Where this sits in the market

Kaltura isn't walking into an empty category. Synthesia, HeyGen, D-ID, Soul Machines, Inworld, and NVIDIA's ACE ecosystem are all pushing on versions of the same idea. But the market is splitting.

One side is still mostly generation: script in, avatar video out.

The other is live interaction inside existing software. That's the harder business and the one enterprises are more likely to pay for, because it maps to actual work. Support deflection. Guided onboarding. Training. Sales qualification. Internal help desks. Those have budgets behind them. "Photorealistic avatar video" on its own often doesn't.

Kaltura has a real advantage here because it already sits inside enterprise video and learning systems. If it can package these agents into products customers already use, it has a better shot than an avatar startup trying to wedge itself into a crowded software stack.

Still, it has to prove the thing works outside a polished demo. Enterprise buyers are going to care about latency, auditability, access controls, and failure handling long before they care whether the avatar smiles convincingly.

What developers should check before buying the pitch

The big technical promise here is sub-300 ms conversational feel. That's ambitious. In narrow conditions, it's possible. It also depends on the whole path being tuned end to end.

For teams evaluating a platform like this, a few questions matter more than any product deck:

  • Does the system stream end to end, or is some hidden stage batching requests and breaking turn-taking?
  • What happens when users interrupt the agent mid-sentence?
  • How is tenant data isolated in retrieval and indexing?
  • Can the stack fall back cleanly to chat or human handoff when ASR quality drops or GPUs get saturated?
  • What telemetry exists for prompt logs, tool calls, and content review?

Security and governance will decide how far this can spread inside a company. If an avatar is reading screens, touching internal knowledge, and speaking to customers, you need PII detection, redaction, audit logs, and hard permission boundaries. If synthetic media shows up in regulated environments, provenance tags and standards like C2PA start to matter too. Watermarking won't solve trust by itself, but missing provenance controls will slow procurement fast.

There's also the cost problem. Real-time multimodal systems are expensive compared with basic chat. Live audio streaming, inference, rendering, and retrieval can burn through GPU time quickly. For high-volume support use cases, the economics have to beat either human-assisted workflows or simpler chat interfaces. Some deployments will justify the spend. A lot won't.

That's why screen awareness may matter more than the face. If the system can see that a customer is stuck on a claims form, identify the missing field, explain the issue, and guide them through it in the same session, there's a measurable business case. If it mostly reads back content through a talking head, it's decoration.

Likely outcome

Kaltura is making a sensible bet. Video platforms have been looking for a way to stay relevant as AI agents move closer to the user interface. Buying eSelf gives it a plausible route: take the video surfaces enterprises already use and make them interactive enough to handle some support, training, and onboarding work.

The limits are obvious too. These systems only feel useful when latency stays low, retrieval stays accurate, and the agent knows when to stop. Enterprise customers may tolerate a synthetic face. They won't tolerate hallucinated answers about billing, policy, or patient data.

So the acquisition makes sense. The hard part starts with integration. If Kaltura can ship avatars that are fast, governed, and tied into real enterprise workflows, this will look smart. If it ends up as a glossy shell around slow LLM calls, it'll join the pile of AI features people try once and ignore.

What to watch

The limitation is that creative output quality is only one part of adoption. Rights, review workflows, brand control, and editability matter just as much. Teams should separate impressive generation from repeatable production use.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI video automation

Automate repetitive creative operations while keeping review and brand control intact.

Related proof
AI video content operations

How content repurposing time dropped by 54%.

Related article
Zoom puts an AI avatar on its earnings call. That makes this a product story

Zoom CEO Eric Yuan opened a quarterly earnings call on May 22 with an AI avatar built with Zoom Clips. Klarna’s CEO had done the same earlier that week. Two public-company CEOs using synthetic video in investor-facing settings makes this a product an...

Related article
Character.AI introduces AvatarFX, a video model for animating chatbot avatars

Character.AI has unveiled AvatarFX, a video generation model built to animate chatbot characters from either text prompts or still images. It's in closed beta for now. The pitch is simple: take a static avatar, give it a script, and render a speaking...

Related article
Character.AI adds AvatarFX video generation, Scenes, and social Streams

Character.AI built its audience on text chat with synthetic personalities. Now it wants those characters to move, talk, and circulate in a social feed. The latest rollout adds three pieces at once: AvatarFX for short AI-generated videos, Scenes for s...