Conntour raises $7M to build natural-language search for security video
Conntour has raised a $7 million seed round from General Catalyst, Y Combinator, SV Angel, and Liquid 2 Ventures to build an AI search layer for security video systems. The pitch fits in a sentence: ask a plain-English question across live or recorde...
Conntour just raised $7M to make security video searchable, and the hard part is compute
Conntour has raised a $7 million seed round from General Catalyst, Y Combinator, SV Angel, and Liquid 2 Ventures to build an AI search layer for security video systems. The pitch fits in a sentence: ask a plain-English question across live or recorded camera feeds and get back relevant clips, answers, and incident summaries.
The harder part is the one that counts. Conntour says it can run across thousands of camera feeds and monitor up to 50 feeds on a single Nvidia RTX 4090. If that holds up in production, it matters. Video search is no longer waiting on model capability. It’s waiting on cost, latency, and the usual deployment mess.
Lots of AI video startups can demo “find a person in a red jacket.” Far fewer can do it across a big camera fleet without turning inference into a budget problem.
Why this round matters
Conntour is less than two years old, and the round reportedly closed in 72 hours. That says plenty about investor appetite. There’s also an early traction signal that carries more weight. The company says it already has public sector and enterprise customers, including Singapore’s Central Narcotics Bureau.
That matters because surveillance buyers don’t care about slick demos. They care about uptime, false positives, deployment options, and whether the product fits into the systems they already have.
Conntour says its platform can run on premises, in the cloud, or in a hybrid setup. It can plug into existing video management systems or replace them. In this market, that’s basic survival. Plenty of organizations can’t ship raw security footage to the cloud because of compliance, bandwidth, or politics. A product built around centralized cloud inference has already cut off part of the market.
Search is becoming the interface
The old model for video systems is still everywhere: fixed rule-based alerts, endless camera grids, and operators scrubbing timelines after something goes wrong. It’s clumsy and expensive.
Natural-language search changes the interface. A user types something like:
find people in red sneakers passing a backpack in the lobby after 9 p.m.
That sounds like a language-model problem. It is, partly. Mostly it’s a systems problem.
The request has to be turned into an execution plan: detect people, identify bags, filter by time and camera, infer a handoff event, then rank possible matches. Run a heavy vision-language model on every frame from every stream and the GPU bill gets stupid fast. Serious vendors are landing on roughly the same pattern: cheap always-on indexing, expensive verification only when needed.
That seems to be where Conntour is putting its effort. Good call.
How this probably works
Conntour hasn’t published a full architecture, but the feature set points to a familiar pipeline.
At ingest, the system likely pulls streams over RTSP or ONVIF, uses hardware decode such as NVDEC for H.264 and H.265, and normalizes frame rates and resolutions. That layer sounds boring right up until it breaks.
After that, scale usually means selective processing. Run lightweight detectors and trackers at reduced FPS, probably with FP16 or INT8 quantization, and maintain tracklets with timestamps, regions, and basic attributes. Color, motion, object class, maybe rough clothing features. Enough to build a searchable index without paying for full-model inference on every frame.
Then compute multimodal embeddings for selected frames or events, likely with something in the CLIP family or a smaller variant, and store them in a vector index such as FAISS or Milvus. Add metadata for camera ID, time range, and region of interest. That gives you a fast candidate retrieval layer.
The query path probably looks like this:
- parse the natural-language request into predicates and filters
- hit the index first to retrieve likely candidates
- run relation checks or heavier classifiers only on those candidates
- feed event summaries or captions into an LLM for reranking and explanation
- return clips with confidence scores and timeline context
That last point matters. If the LLM is looking at captions or structured event data instead of raw pixels, compute stays under control and the language model is less likely to become the direct control surface for alarms.
That’s solid engineering. It’s also the safer product choice.
The 50-feeds-per-4090 claim is plausible, with caveats
Fifty feeds on one RTX 4090 sounds aggressive. It’s also believable if you read the fine print that startups rarely put in the headline.
A 4090 can handle hundreds of lightweight detections per second at moderate resolutions if you use TensorRT, batching, quantization, frame skipping, and aggressive ROI cropping. If each stream is sampled at 4 to 8 FPS for baseline detection and tracking, 50 streams lands in the low hundreds of effective frames per second. That’s workable for compact models.
The caveats are where the business lives:
- static scenes are cheap, crowded scenes are not
- coarse search is cheap, high-recall fine-grained search is not
- one well-placed camera can beat several bad ones
- “sneakers” and “backpack handoff” are much harder than “person enters door”
So yes, the throughput claim can be real. It depends on what’s always running, what gets deferred to on-demand verification, and how ugly the footage is. Any buyer should ask those questions early.
This is why model routing matters so much. The winner here may be the company with the best query planner, not the flashiest single model.
Confidence scores matter
Conntour says it returns confidence scores so operators can judge how trustworthy a match is, especially on low-quality video. That sounds modest. It’s actually one of the better product choices in the stack.
Security video is messy. Compression artifacts, bad lighting, weird angles, old cameras, motion blur. If the system presents a clean answer box without exposing uncertainty, people start treating fuzzy retrieval like fact. In law enforcement, corporate investigations, and similar settings, that gets dangerous fast.
Confidence scoring doesn’t solve the problem by itself. Calibration is hard. Most model scores are too confident out of the box, and combining detector confidence, attribute classification, and relationship inference into one number can get fuzzy quickly. Still, showing uncertainty beats pretending the system knows.
The safer pattern is clear enough: keep automated alarms tied to deterministic rules and calibrated thresholds, then use LLM-style components for search, summarization, and operator assistance. Conntour appears to be thinking in that direction. It should.
Why developers should care
Even if you never work in surveillance, the architecture pattern is worth watching.
This is what a lot of production multimodal systems look like now:
- lightweight always-on processing
- embeddings plus metadata indexes
- query decomposition into predicates
- selective escalation to heavier models
- language models reasoning over structured outputs, not raw media
- hybrid edge and cloud execution
You see the same shape in industrial inspection, sports analytics, retail operations, and robotics. Video becomes queryable when the system treats compute as scarce instead of assuming every question deserves full-fat inference.
There’s also a plain lesson for infra teams. Fancy foundation models don’t carry the product on their own. You still need stream ingestion, codec handling, batching strategy, event schemas, RBAC, audit logging, and sane retention policies. In enterprise deployments, those boring parts often decide whether the thing survives contact with reality.
Privacy and governance are part of the product
Conntour already sells into the public sector, so the governance questions are immediate. Searchable surveillance footage is a different product from passive video retention. It lowers the cost of asking invasive questions, and that changes how the system gets used.
A serious platform needs audit trails for every query, role-based access control, SSO integration, and strict scoping around which users can search which cameras and time windows. Hybrid deployment helps, but it doesn’t answer the underlying issue. Easier search puts more power behind fewer keystrokes.
That’s not a reason to write off the product. It is a reason to treat safety controls as core product work, not paperwork for compliance.
What to watch next
The immediate question is whether Conntour can keep recall and latency in a good place as queries get harder and customer environments get messier.
Can it handle thousands of feeds without collapsing into brittle heuristics? Can it integrate cleanly with incumbent VMS systems? Can it show buyers where the confidence scores come from? Can it keep GPU use predictable once users start asking harder questions all day?
Those are product questions, but they’re technical questions too. They separate a sharp demo from infrastructure people will trust.
The funding round suggests investors think this market is opening up now. They’re probably right. Vision-language models have improved enough that a search box over video feels practical. The hard part is turning that into a system that’s fast, cheap enough to run, and honest about what it can’t really see.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Build retrieval systems that answer from the right business knowledge with stronger grounding.
How a grounded knowledge assistant reduced internal document search time by 62%.
Keyword search has always been a weak fit for hiring. Anyone trying to find a strong infra engineer, applied ML lead, or staff backend developer has seen the problem. The people who can do the work often don’t describe themselves in the tidy terms a ...
AI video looks a lot better than it did a year ago. The audio still lags behind. Plenty of clips sound cheap, and plenty ship with no sound at all. Berlin startup Mirelo has raised a $41 million seed round from Index Ventures and Andreessen Horowitz ...
Character.AI has unveiled AvatarFX, a video generation model built to animate chatbot characters from either text prompts or still images. It's in closed beta for now. The pitch is simple: take a static avatar, give it a script, and render a speaking...