What is TML-Interaction-Small?

TML-Interaction-Small is Thinking Machines Lab's first announced interaction model, designed to process input and generate output at the same time for real-time dialogue.

What does full-duplex mean in conversational AI?

Full-duplex means the AI can listen while it is speaking, allowing it to react to interruptions, corrections, and partial user input.

Can developers use Thinking Machines Lab's interaction model now?

No. The company says a limited research preview is planned in the next few months, with broader access expected later.

Artificial intelligence May 12, 2026

Thinking Machines Lab introduces interaction models for real-time AI dialogue

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, has announced a research effort called interaction models. The short version: an AI model that can listen and respond at the same time. That may sound minor, but it cuts...

Thinking Machines is chasing the missing piece in AI voice: interruption

Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, has announced a research effort called interaction models. The short version: an AI model that can listen and respond at the same time.

That may sound minor, but it cuts against how most assistants still work. You speak or type. The model waits. It generates. You wait. Even voice assistants with polished audio interfaces often behave like chatbots with speech bolted on.

Thinking Machines’ first model in this line, TML-Interaction-Small, is designed for full-duplex interaction, where input processing and output generation run concurrently. The company says the model can respond in 0.40 seconds, close to the timing of natural conversation and faster than comparable systems from OpenAI and Google in its published benchmarks.

The catch is obvious: nobody outside the company can use it yet. Thinking Machines says a limited research preview will arrive in the next few months, with broader access planned later this year.

For now, this is a technical claim with interesting implications, not a product developers can build against.

Why full-duplex matters

Most language models are built around completed input. A user sends a prompt, the model consumes it, then produces tokens. Streaming output makes the answer appear faster, but the interaction pattern remains mostly sequential.

Human conversation is messier. People interrupt, clarify, backchannel, trail off, restart, and adjust based on verbal cues. If someone says, “Actually, wait,” you stop. If they correct you mid-sentence, you adapt. If they sound confused, you slow down or rephrase.

Current AI voice systems struggle because the model usually isn’t listening in a meaningful way while it speaks. Many systems stitch together several components:

automatic speech recognition
a language model
text-to-speech
a voice activity detector
interruption handling logic
sometimes a separate turn-taking model

That pipeline can work, but it adds latency and failure points. The assistant talks over the user, stops too eagerly, or waits awkwardly because it can’t tell whether the person is pausing or done.

Thinking Machines is arguing that interactivity should be modeled directly rather than patched around the edges. That’s the useful idea.

If the model can ingest partial user input while generating its own response, it can make small decisions continuously: keep talking, stop, revise, ask a question, or acknowledge an interruption. Voice agents need that behavior in support calls, pair-programming sessions, tutoring, medical intake, sales qualification, and robotics.

The difference between 900 milliseconds and 400 milliseconds looks small in a benchmark table. In conversation, it’s noticeable. Long pauses make an assistant feel sluggish, even when the answer is good. Fast responses with bad timing feel rude. The target is timing, not latency alone.

The technical bet: interactivity as model behavior

Thinking Machines hasn’t released the model or full implementation details, so the claims need caution. The direction is still clear enough.

A full-duplex model has to manage input and output streams at the same time. That creates constraints ordinary chat completion doesn’t have. The model has to reason over incomplete input. It may need to revise its plan while already speaking. It also has to decide whether a new audio segment is a meaningful interruption, background noise, a hesitation, or a user trying to take the floor.

That’s a modeling problem, not just a UI problem.

Traditional LLM serving assumes a fairly clean request-response lifecycle. Full-duplex interaction pushes the system toward continuous inference. Tokens or audio frames arrive while tokens or speech are already going out. State management becomes much harder.

For developers, that raises practical questions:

How does the model expose interruptions through an API?
Can applications control barge-in behavior?
Does the model produce text, audio, semantic events, or all three?
How does it handle partial transcripts?
Can it revise an answer already being spoken?
What happens when the user interrupts with a correction that invalidates the previous response?

These are normal spoken-interaction cases, not edge cases.

A coding assistant is a good example. Suppose the agent starts explaining a Kubernetes deployment issue and the developer cuts in: “No, this is running on ECS, not Kubernetes.” A turn-based system may stop, discard state, and begin again. A better interaction model should absorb the correction, drop the wrong path, and continue without making the exchange feel like a reset.

That would be genuinely useful. It’s also hard.

Benchmarks are promising, but incomplete

Thinking Machines says TML-Interaction-Small responds in 0.40 seconds and beats comparable models from OpenAI and Google on its published latency benchmarks. If that holds up in real use, it’s a strong number.

Voice AI latency benchmarks can be slippery, though. You need to know exactly what’s being measured. Is it time from end-of-speech to first audio? Time from user interruption to model stop? Time from partial input to useful semantic response? Does it include speech recognition and speech synthesis? Is the model running in a lab, on optimized infrastructure, or across real client networks?

Engineers care about those distinctions. A model-level response time of 400 milliseconds can turn into a much worse user experience once you add:

network round trips
audio encoding and decoding
speech synthesis latency
safety filters
retrieval calls
tool execution
logging and compliance checks
mobile device constraints

The best demo path is rarely the production path.

That doesn’t make the research unimportant. It means the number is a signal, not a procurement metric. If Thinking Machines can keep the interaction loop tight after adding tool use, safety controls, and deployment overhead, it has something serious.

Why developers should care

Most AI product teams have spent the past two years wrapping chat models in interfaces: copilots, support bots, internal knowledge assistants, workflow agents. The limits are familiar. The model may be powerful, but the interaction still feels like filling out forms with better autocomplete.

Full-duplex models could change application design where timing matters.

Customer support agents could interrupt only when needed, confirm details while listening, and avoid the painful “please wait while I process that” cadence. AI tutors could catch confusion earlier. Developer tools could feel more collaborative during debugging sessions, especially when paired with screen or IDE context. Data analysis agents could let analysts correct assumptions mid-query instead of waiting for a full wrong answer to finish.

The API design will matter as much as the model. Developers don’t want a black box that randomly decides when to talk. They’ll need controls.

A serious developer platform for interaction models should expose:

interruption events
confidence scores for turn-taking
partial intent detection
configurable latency versus accuracy trade-offs
hooks for tool calls
transcript state
policy controls for when the model may speak over a user
observability for timing and dropped turns

Without those primitives, full-duplex interaction could become a neat demo that’s painful to integrate.

Cost is another concern. Continuous listening and generation may use more compute than turn-based chat. If the model is always maintaining active state, teams will need clear pricing, concurrency limits, and session billing. Voice agents already have awkward unit economics in some settings. A smoother conversation loop won’t help if the cost per resolved ticket blows up.

Security and safety get messier

Full-duplex interaction also complicates safety. A turn-based chatbot can inspect a completed user message before responding. Streaming systems already weaken that boundary. Full-duplex systems go further because the model may begin acting before the user has finished.

That creates risks.

A user might start with benign language and then add a disallowed instruction. Background speech could be misread as input. Prompt injection gets stranger in multimodal or voice contexts, especially if the system is connected to tools. If the model can interrupt, it can also interrupt at the wrong time and steer a user before it has enough context.

For enterprise deployments, auditability becomes a bigger issue. Teams will want to know why an agent stopped, continued, changed direction, or invoked a tool. Logs can’t store only final prompts and responses. They need timing, partial inputs, model state changes, and event traces.

That’s a lot of operational plumbing.

Privacy is another obvious concern. Systems that listen continuously, even for good interaction reasons, need clear boundaries around capture, retention, and processing. “Always listening” makes security teams reach for a policy document, and they’re right to do it.

The research preview caveat

Thinking Machines is still in pre-release mode. The company isn’t making TML-Interaction-Small public today. A limited research preview is expected in the coming months, with broader release planned later in 2026.

That matters because interactive AI is brutally sensitive to real-world conditions. Accents, noisy rooms, low-quality microphones, overlapping speakers, network jitter, domain-specific vocabulary, and impatient users all expose weaknesses that may not show up in clean benchmarks.

The company’s idea is credible. The benchmark claims are worth watching. Until outside developers can test the model in messy applications, it’s too early to say whether Thinking Machines has solved the interaction problem or built an impressive controlled demo.

The direction makes sense. The next step for AI assistants is likely better timing, interruption handling, repair, and shared context, not prettier chat windows.

If Thinking Machines can make that behavior native to the model and expose it through usable developer primitives, it could have a meaningful edge. Otherwise, full-duplex will remain an impressive phrase attached to systems that still make people wait their turn.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

OpenAI acquires Sky to bring AI actions directly into macOS

OpenAI has acquired Software Applications, the startup behind Sky, an unreleased AI interface for macOS that can sit above the desktop, read what’s on screen, and take actions across apps. That pushes OpenAI past the chat window and into the OS. If C...

Why Amazon's AGI SF Lab is putting HCI on the AI keynote stage

TechCrunch reports that Danielle Perszyk, who leads human-computer interaction at Amazon’s AGI SF Lab, will keynote TechCrunch Sessions: AI on June 5 at UC Berkeley’s Zellerbach Hall. That’s conference news, but it also says something about where the...

Google DeepMind's SIMA 2 uses Gemini for goal-directed action in games

Google DeepMind’s new SIMA 2 research preview matters because it pushes AI agents beyond scripted instruction-following demos and closer to usable autonomy inside interactive environments. The headline is straightforward. SIMA 2 combines Gemini’s rea...