Artificial Intelligence June 16, 2025

YC Spring 2025 Demo Day: The AI Startups Worth Watching Beyond Chatbot Wrappers

Y Combinator’s Spring 2025 batch has the usual stack of "AI for X" startups, but a few point to something more durable than another chatbot wrapper. The interesting companies fall into three groups: better chips, better agent infrastructure, and bett...

YC Spring 2025 Demo Day: The AI Startups Worth Watching Beyond Chatbot Wrappers

YC Spring 2025 shows where AI startups are getting serious

Y Combinator’s Spring 2025 batch has the usual stack of "AI for X" startups, but a few point to something more durable than another chatbot wrapper. The interesting companies fall into three groups: better chips, better agent infrastructure, and better ways to get software discovered as search shifts under everyone’s feet.

That matters because the AI stack is getting expensive, messy, and harder to trust. Model quality still matters. So do evals, retrieval quality, thermal limits, enterprise connectors, and whether your product shows up in an LLM-generated answer. YC’s better companies are working on those awkward, practical layers.

The standout names here are Atum, Anvil, LLM Data Company, Auctor, Den, Cactus, and Eloquent. Different products, same direction. The market is moving past generic assistants and toward systems that do one job well, fit real workflows, and can be measured.

Atum is making the hardest bet in the batch

The boldest technical swing here is Atum’s push into monolithic 3D chip design.

The problem is familiar. Traditional planar scaling is slowing around the 3nm era. Foundries can still squeeze out gains, but the free lunch is over. If you want better performance per watt for AI workloads, you increasingly have to change the physical architecture.

Atum’s design stacks transistors vertically in a 3D die using through-silicon vias and wafer-to-wafer bonding. The details that matter:

  • fine-pitch TSVs in the 1 to 2 micron range
  • inter-layer latency around 5 picoseconds
  • logic in lower tiers
  • SRAM or DRAM-style cache placed above, closer to compute units
  • microfluidic cooling channels between layers

That last part is where a lot of 3D chip concepts run into physics. Stacking compute shortens wiring and improves memory access, but it also traps heat. If Atum can make embedded cooling work in production silicon, it matters. If not, this stays a very smart lab project.

The claimed result is 2x to 3x performance per watt versus comparable 2D chips at the same node. That’s credible enough to take seriously, especially for AI inference and accelerator-heavy workloads where memory locality and interconnect latency matter as much as raw transistor count. It’s also the kind of gain that gets attention when power budgets are becoming a harder constraint than software teams like to admit.

For developers, none of this changes your code tomorrow. It does change the assumptions behind deployment over the next few years. A workable 3D stack with better local memory behavior could favor models and runtimes built for higher-bandwidth on-package memory and tighter compute-memory coupling. Edge inference also gets more interesting if "small but fast" hardware stops looking so compromised.

Still, semiconductor startups live and die on manufacturing reality. Yield, thermal reliability, packaging complexity, and cost per wafer will decide whether Atum becomes real infrastructure or a footnote.

Agents are getting more specific

A big chunk of the YC cohort is building vertical AI agents. That category sounds tired. Some of these companies are finally dealing with the parts generic copilots skip.

Den targets knowledge workers. Vesence goes after lawyers. Cactus is aimed at solo operators. Eloquent focuses on financial operations. Underneath the branding, the pattern is straightforward: retrieval-heavy systems with domain tuning, custom indexes, SaaS connectors, and workflow logic wrapped in an interface that doesn’t require the user to think like an LLM engineer.

That’s where useful products usually live. Legal work has different retrieval and citation needs than finance ops. Solopreneurs need CRM, calls, invoices, and payments tied together. Knowledge work systems need permissions, provenance, and decent search across ugly internal data.

The hard part is that vertical agents inherit both model problems and enterprise software problems. Hallucinations are bad enough in general productivity tools. In legal or finance contexts, they can do real damage. The winners here won’t be the teams with the slickest chat UI. They’ll be the ones with the strongest retrieval layer, clean audit trails, and workflow orchestration that doesn’t fall apart under pressure.

That’s why Auctor matters more than it first appears to. It’s a low-code orchestrator for systems like SAP and ServiceNow, meant to shrink enterprise automation rollouts from months to days. Plenty of engineers hear "low-code" and tune out. That’s lazy. Integration is where a lot of AI deployments stall. If Auctor can template the dull connector work and give business teams guardrails without creating another shadow IT problem, that’s useful.

The trade-off is obvious enough. Low-code orchestration speeds up delivery, but abstraction layers can hide failure modes until they hit production. Teams adopting tools like this need observability, versioned workflows, and explicit rollback paths. Otherwise you get a polished demo and a miserable quarter.

LLM Data Company is working on the boring part that keeps getting expensive

One of the strongest ideas here is LLM Data Company, focused on automated evaluation for AI agents.

This is overdue.

Everyone building agents talks about benchmarks, but a lot of teams still evaluate with vibes, ad hoc prompts, and a handful of hand-checked transcripts. That doesn’t scale. It also won’t survive a compliance review.

LLM Data Company’s setup includes:

  • critic models tuned to score outputs
  • reward hooks for reinforcement-style optimization
  • benchmark suites for domain tasks like code generation or customer service
  • metrics around accuracy, latency, and policy violations

A sample workflow in the source material evaluates a banking customer support agent and returns a summary like Accuracy: 92.3%, Latency: 420ms, Compliance Violations: 0.

That kind of pipeline belongs in CI/CD for any serious agent product. Not as the only signal, because model-based judges can be noisy or biased, but as a baseline for regression testing. If a prompt change, retrieval tweak, or model swap dents compliance or spikes latency, you want to catch that before users do.

There’s a broader shift underneath this. Agent development is starting to look like normal software engineering, just with probabilistic parts. More test harnesses. More benchmark suites. More red-team inputs. Less magical thinking.

The risk is false confidence. LLM judges can miss edge cases, overrate fluent nonsense, or mirror the preferences of the model family they came from. Teams should treat automated evals as one layer in a stack that still includes human review in high-risk domains. But if you don’t have automated evals at all, you’re flying blind.

Anvil is built for the new search mess

Anvil is one of the more practical startups in the batch because it tackles a shift every publisher and product team is already feeling: search traffic is being rerouted through chat interfaces and AI-generated summaries.

Its pitch is "SEO for LLM search," which sounds a little goofy until you look at the mechanics. The company is working with:

  • structured prompt metadata
  • dynamic snippet generation
  • APIs that feed document embeddings into vector stores such as Pinecone or Weaviate
  • integrations with systems tied to ChatGPT and Gemini-style discovery flows

The sample implementation uses JSON-LD on a technical article to expose headline, description, keywords, author, and publisher metadata in a machine-friendly format. That part isn’t new. What’s changing is the consumer. Instead of just helping classic search engines rank pages, the metadata can influence how generative systems retrieve, summarize, and cite content.

There’s an uncomfortable implication here. Publishing for humans and publishing for machine intermediaries are becoming separate disciplines. Engineering teams will need content systems that produce clean structured data, stable canonical pages, and snippets that survive compression by LLM interfaces. If your docs, changelogs, or support content are scattered and inconsistent, AI search will flatten them into generic mush.

Anvil is betting companies will pay to avoid that. They probably will.

What developers should take from this batch

A few things stand out.

First, RAG is no longer optional glue code for vertical AI products. If you’re building domain assistants, retrieval quality is part of the product. Chunking strategy, indexing, access controls, freshness, and observability matter as much as your prompt template.

Second, evals need to move earlier in the build cycle. If your team is still manually spot-checking outputs before release, you’re behind. Add benchmark tasks, automated reviewers, and latency checks before product complexity forces the issue.

Third, enterprise integration is still the tax on AI deployment. The startups that cut through SAP, ServiceNow, CRM systems, and file silos are working on the actual bottleneck. Fancy models are easy to demo. Reliable data flow across old systems is where budgets disappear.

Fourth, hardware constraints are back in the conversation. Atum’s work is a reminder that model software isn’t the only moving layer. Thermal limits, memory proximity, and packaging choices will shape what inference looks like over the next few years, especially at the edge.

Finally, distribution is changing. If LLM interfaces become a primary way people find products and technical content, metadata and retrieval visibility start to matter almost as much as page rank once did.

YC’s Spring 2025 batch doesn’t prove any of these companies will win. YC batches never do. But it does show where sharper founders are spending time: less on generic chat wrappers, more on evals, orchestration, retrieval, and hardware that can carry the load. That’s a healthier signal than another hundred copilots with fresh logos.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
Web and mobile app development

Build AI-backed products and internal tools around clear product and delivery constraints.

Related proof
Growth analytics platform

How analytics infrastructure reduced decision lag across teams.

Related article
CES 2026 puts physical AI, robotics, and edge silicon at the center

CES 2026 made one point very clearly: AI demos have moved past chatbots and image generators. This year, the loudest signal was physical AI. Robots, autonomous machines, sensor-heavy appliances, warehouse systems, and a lot of silicon built to run pe...

Related article
Why Benchmark is putting another $225M into Cerebras

Benchmark raising $225 million in special funds to buy more of Cerebras says two things pretty clearly. Cerebras is no longer priced like an oddball chip startup. It just raised $1 billion at a $23 billion valuation, with Tiger Global leading the rou...

Related article
US Chip Market in H1 2025: Intel Cuts, Nvidia Caps, AMD Deals

The first half of 2025 has made the US chip market look a lot less tidy than the AI boom narrative suggested. Intel is cutting deep while trying to restore some internal discipline under Lip-Bu Tan. Nvidia is still the core supplier for AI infrastruc...