How many camera feeds can Conntour process on a single GPU?

Conntour reports monitoring up to 50 live feeds on a single Nvidia RTX 4090.

What video protocols and codecs does Conntour support?

It ingests streams via RTSP and ONVIF with hardware decoding for H.264 and H.265 using NVDEC.

How does Conntour keep indexing costs low at scale?

It runs lightweight detectors and trackers at reduced FPS with FP16/INT8 quantization and only applies heavy vision-language models on selected frames or events.

Computer Vision March 29, 2026

Conntour raises $7M to build natural-language search for security video

Conntour has raised a $7 million seed round from General Catalyst, Y Combinator, SV Angel, and Liquid 2 Ventures to build an AI search layer for security video systems. The pitch fits in a sentence: ask a plain-English question across live or recorde...

Conntour just raised $7M to make security video searchable, and the hard part is compute

The harder part is the one that counts. Conntour says it can run across thousands of camera feeds and monitor up to 50 feeds on a single Nvidia RTX 4090. If that holds up in production, it matters. Video search is no longer waiting on model capability. It’s waiting on cost, latency, and the usual deployment mess.

Lots of AI video startups can demo “find a person in a red jacket.” Far fewer can do it across a big camera fleet without turning inference into a budget problem.

Why this round matters

Conntour is less than two years old, and the round reportedly closed in 72 hours. That says plenty about investor appetite. There’s also an early traction signal that carries more weight. The company says it already has public sector and enterprise customers, including Singapore’s Central Narcotics Bureau.

That matters because surveillance buyers don’t care about slick demos. They care about uptime, false positives, deployment options, and whether the product fits into the systems they already have.

Conntour says its platform can run on premises, in the cloud, or in a hybrid setup. It can plug into existing video management systems or replace them. In this market, that’s basic survival. Plenty of organizations can’t ship raw security footage to the cloud because of compliance, bandwidth, or politics. A product built around centralized cloud inference has already cut off part of the market.

Search is becoming the interface

The old model for video systems is still everywhere: fixed rule-based alerts, endless camera grids, and operators scrubbing timelines after something goes wrong. It’s clumsy and expensive.

Natural-language search changes the interface. A user types something like:

find people in red sneakers passing a backpack in the lobby after 9 p.m.

That sounds like a language-model problem. It is, partly. Mostly it’s a systems problem.

The request has to be turned into an execution plan: detect people, identify bags, filter by time and camera, infer a handoff event, then rank possible matches. Run a heavy vision-language model on every frame from every stream and the GPU bill gets stupid fast. Serious vendors are landing on roughly the same pattern: cheap always-on indexing, expensive verification only when needed.

That seems to be where Conntour is putting its effort. Good call.

How this probably works

Conntour hasn’t published a full architecture, but the feature set points to a familiar pipeline.

At ingest, the system likely pulls streams over RTSP or ONVIF, uses hardware decode such as NVDEC for H.264 and H.265, and normalizes frame rates and resolutions. That layer sounds boring right up until it breaks.

After that, scale usually means selective processing. Run lightweight detectors and trackers at reduced FPS, probably with FP16 or INT8 quantization, and maintain tracklets with timestamps, regions, and basic attributes. Color, motion, object class, maybe rough clothing features. Enough to build a searchable index without paying for full-model inference on every frame.

Then compute multimodal embeddings for selected frames or events, likely with something in the CLIP family or a smaller variant, and store them in a vector index such as FAISS or Milvus. Add metadata for camera ID, time range, and region of interest. That gives you a fast candidate retrieval layer.

The query path probably looks like this:

parse the natural-language request into predicates and filters
hit the index first to retrieve likely candidates
run relation checks or heavier classifiers only on those candidates
feed event summaries or captions into an LLM for reranking and explanation
return clips with confidence scores and timeline context

That last point matters. If the LLM is looking at captions or structured event data instead of raw pixels, compute stays under control and the language model is less likely to become the direct control surface for alarms.

That’s solid engineering. It’s also the safer product choice.

The 50-feeds-per-4090 claim is plausible, with caveats

Fifty feeds on one RTX 4090 sounds aggressive. It’s also believable if you read the fine print that startups rarely put in the headline.

A 4090 can handle hundreds of lightweight detections per second at moderate resolutions if you use TensorRT, batching, quantization, frame skipping, and aggressive ROI cropping. If each stream is sampled at 4 to 8 FPS for baseline detection and tracking, 50 streams lands in the low hundreds of effective frames per second. That’s workable for compact models.

The caveats are where the business lives:

static scenes are cheap, crowded scenes are not
coarse search is cheap, high-recall fine-grained search is not
one well-placed camera can beat several bad ones
“sneakers” and “backpack handoff” are much harder than “person enters door”

So yes, the throughput claim can be real. It depends on what’s always running, what gets deferred to on-demand verification, and how ugly the footage is. Any buyer should ask those questions early.

This is why model routing matters so much. The winner here may be the company with the best query planner, not the flashiest single model.

Confidence scores matter

Conntour says it returns confidence scores so operators can judge how trustworthy a match is, especially on low-quality video. That sounds modest. It’s actually one of the better product choices in the stack.

Security video is messy. Compression artifacts, bad lighting, weird angles, old cameras, motion blur. If the system presents a clean answer box without exposing uncertainty, people start treating fuzzy retrieval like fact. In law enforcement, corporate investigations, and similar settings, that gets dangerous fast.

Confidence scoring doesn’t solve the problem by itself. Calibration is hard. Most model scores are too confident out of the box, and combining detector confidence, attribute classification, and relationship inference into one number can get fuzzy quickly. Still, showing uncertainty beats pretending the system knows.

The safer pattern is clear enough: keep automated alarms tied to deterministic rules and calibrated thresholds, then use LLM-style components for search, summarization, and operator assistance. Conntour appears to be thinking in that direction. It should.

Why developers should care

Even if you never work in surveillance, the architecture pattern is worth watching.

This is what a lot of production multimodal systems look like now:

lightweight always-on processing
embeddings plus metadata indexes
query decomposition into predicates
selective escalation to heavier models
language models reasoning over structured outputs, not raw media
hybrid edge and cloud execution

You see the same shape in industrial inspection, sports analytics, retail operations, and robotics. Video becomes queryable when the system treats compute as scarce instead of assuming every question deserves full-fat inference.

There’s also a plain lesson for infra teams. Fancy foundation models don’t carry the product on their own. You still need stream ingestion, codec handling, batching strategy, event schemas, RBAC, audit logging, and sane retention policies. In enterprise deployments, those boring parts often decide whether the thing survives contact with reality.

Privacy and governance are part of the product

Conntour already sells into the public sector, so the governance questions are immediate. Searchable surveillance footage is a different product from passive video retention. It lowers the cost of asking invasive questions, and that changes how the system gets used.

A serious platform needs audit trails for every query, role-based access control, SSO integration, and strict scoping around which users can search which cameras and time windows. Hybrid deployment helps, but it doesn’t answer the underlying issue. Easier search puts more power behind fewer keystrokes.

That’s not a reason to write off the product. It is a reason to treat safety controls as core product work, not paperwork for compliance.

What to watch next

The immediate question is whether Conntour can keep recall and latency in a good place as queries get harder and customer environments get messier.

Can it handle thousands of feeds without collapsing into brittle heuristics? Can it integrate cleanly with incumbent VMS systems? Can it show buyers where the confidence scores come from? Can it keep GPU use predictable once users start asking harder questions all day?

Those are product questions, but they’re technical questions too. They separate a sharp demo from infrastructure people will trust.

The funding round suggests investors think this market is opening up now. They’re probably right. Vision-language models have improved enough that a search box over video feels practical. The hard part is turning that into a system that’s fast, cheap enough to run, and honest about what it can’t really see.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

RAG development services

Build retrieval systems that answer from the right business knowledge with stronger grounding.

Related proof

Internal docs RAG assistant

How a grounded knowledge assistant reduced internal document search time by 62%.

Why Juicebox is replacing keyword search with LLM search in hiring

Keyword search has always been a weak fit for hiring. Anyone trying to find a strong infra engineer, applied ML lead, or staff backend developer has seen the problem. The people who can do the work often don’t describe themselves in the tidy terms a ...

Mirelo raises $41M to fix the audio gap in AI video generation

AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all. Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz ...

Character.AI introduces AvatarFX, a video model for animating chatbot avatars

Character.AI has unveiled AvatarFX, a video generation model built to animate chatbot characters from either text prompts or still images. It's in closed beta for now. The pitch is simple: take a static avatar, give it a script, and render a speaking...