What are the main challenges of running AI systems in production?

Inference costs under real usage, stale retrieval systems, rapid code defects from copilots, and compliance reviews pose the biggest production hurdles.

How do tools like vLLM and TensorRT-LLM reduce inference costs?

By enabling kernel fusion, KV cache reuse, and optimized GPU utilization, they squeeze more throughput from hardware.

Why is data gravity more impactful than model improvements?

Because slow or stale data retrieval and embedding freshness directly affect application latency and relevance, often more than marginal model gains.

Artificial Intelligence September 2, 2025

TechCrunch Disrupt 2025 puts AI infrastructure and applications on one stage

TechCrunch Disrupt 2025 is putting two parts of the AI market next to each other, and the pairing makes sense. One is Greenfield Partners with its “AI Disruptors 60” list, a snapshot of startups across AI infrastructure, applications, and go-to-marke...

TechCrunch Disrupt’s AI agenda gets serious about infra and code quality

TechCrunch Disrupt 2025 is putting two parts of the AI market next to each other, and the pairing makes sense.

One is Greenfield Partners with its “AI Disruptors 60” list, a snapshot of startups across AI infrastructure, applications, and go-to-market. The other is JetBrains CEO Kirill Skrygan, who plans to argue against the “vibe coding” mood and push for AI tools that favor code quality, reliability, and precision inside the IDE.

That combination says something about where the conversation has moved. Less attention on demo clips and benchmark bragging. More on whether these systems hold up in production.

For engineering teams, that's the part that matters.

Where the market has moved

The easy phase of the generative AI boom has passed. Wiring up a frontier model and cutting a slick product video was never the hard part. The hard part starts when the system hits real traffic, security review, procurement, and a codebase that already has rules.

The same problems keep showing up:

inference costs that collapse under real usage
retrieval systems that go stale quickly
copilots that write code fast and break things just as fast
compliance and audit requirements nobody wanted to deal with until the buying process started

That’s why these two Disrupt sessions are worth watching together. Greenfield’s list should show where infrastructure money and operator attention are settling. JetBrains is focused on a problem the coding-assistant market still handles badly: getting AI-generated code to behave like code meant for a real software team, with tests, architecture, review standards, and security constraints.

A lot of vendors still sell speed. Serious teams care about defect density, change failure rate, MTTR, and whether a generated patch survives review without wasting half the team’s day.

That’s a stricter standard, and a healthier one.

Why the infra side matters now

If Greenfield’s “AI Disruptors 60” is useful, it should show where the practical AI stack is settling.

That stack is getting easier to recognize. On inference, tools like vLLM and TensorRT-LLM matter because they squeeze more work out of expensive GPUs. KV cache reuse, paged attention, and kernel fusion are now basic economics. Providers need that throughput or the cost model falls apart.

Then you get the serving layer. Triton, Ray Serve, and KServe are becoming standard machinery for autoscaling and multi-model routing. Below that, storage and data movement often matter more than model cleverness. That’s why VAST Data belongs in the conversation. Retrieval-heavy inference is often limited by memory bandwidth and IO before raw flops become the issue.

A lot of AI coverage still misses this. Better models don’t solve everything. In practice, data gravity beats model gravity more often than people like to admit.

If your embeddings are stale, your hybrid search is sloppy, or your retrieval path adds too much latency, the application feels dumb regardless of how good the base model is. A startup that keeps hot embeddings close to compute, cuts cross-rack traffic, and manages freshness well may have a stronger product than one wrapping the same API with slightly better prompts.

That also explains the renewed attention on:

vector search methods like HNSW and IVF-PQ
hybrid retrieval that mixes sparse and dense signals
freshness-aware caches
schema-aware chunking for RAG
policy engines and lineage tracking for regulated environments

It’s less flashy. It’s where production systems succeed or fail.

JetBrains has a point about coding assistants

The coding assistant market has spent two years chasing output volume. Whole files. Large rewrites. Fast autocomplete. Great conference demos, then cleanup work the next morning.

JetBrains’ argument is narrower and stronger. If AI sits inside the IDE and uses actual program structure, it can do better than plain text generation.

That means feeding the model structured information from systems like PSI and UAST, not just raw repository text. Symbols, types, scopes, call graphs, test relationships. The context tells the model what the code is doing, not just what it statistically resembles.

That changes quality in obvious ways:

fewer hallucinated imports
fewer wrong method signatures
fewer edits that compile locally and fail in CI
better patch targeting when a change touches multiple files and tests

This is one of the more important splits in AI coding right now. General-purpose copilots still treat code too much like prose. That works for a while. Then the lack of semantic grounding shows up as bad refactors, subtle breakage, and “helpful” edits that ignore the architecture.

JetBrains has a real advantage here because IDE-native tooling usually has richer local context than browser assistants or chat-first products. If the system can retrieve repository-specific interfaces, map usages, synthesize tests, run linters, and reject invalid candidates before the user sees them, it starts to feel like an engineering tool instead of autocomplete with a lot of confidence.

Vendors don’t always like saying this plainly, but it’s true: a coding model without strong constraints generates a lot of expensive noise.

The better version of AI coding may look slower

There’s a trade-off, and it’s worth stating directly. Quality-focused AI coding can look slower in a demo.

If you generate diff-first patches, run static analysis, execute tests in a sandbox, and feed failures back into the prompt loop, you add latency. If you manage token budgets carefully and fetch only the hot parts of a repo instead of cramming the whole codebase into context, you may generate less text each time.

That’s fine. The goal is better accepted changes, fewer rollbacks, and less review churn.

For teams evaluating these tools, the useful questions are pretty ordinary:

Does pass@k improve on internal tasks?
Does review turnaround improve or get worse?
Does AI assistance lower reopen rates?
What happens to test coverage on AI-generated patches?
Can the tool respect code ownership and approval gates?
Does it keep prompts and outputs inside your security boundary?

Those are procurement questions now because they were engineering questions first.

Governance is part of the system

This is where the Greenfield and JetBrains threads line up again. Production AI systems need governance the same way production software needs observability and access control. Vendors spent too long acting like governance could wait.

It can’t.

Teams doing this seriously are already adding:

OPA-style policy checks for tool permissions
prompt and output filtering for secrets and PII
provenance records for prompts, context, model versions, and generated diffs
audit trails that map inputs to outputs
model and artifact signing
SBOM-style tracking for model dependencies and deployment artifacts

Part of that comes from regulation. The EU AI Act is pushing traceability and accountability higher up the list. Part of it is standard enterprise discipline. If an AI system can propose code changes, query internal docs, touch infrastructure, or generate customer-facing content, somebody is going to ask who approved what and why.

Fair enough.

What engineering leaders should take from this

If you run platform, developer productivity, or applied AI teams, the takeaway is straightforward: the stack is hardening around systems that can be measured, constrained, and audited.

A few practical implications follow.

First, build the context pipeline properly. Naive fixed-size chunking is still one of the worst habits in RAG-heavy systems, especially for code. Repositories need symbol-aware indexing, dependency graphs, test mappings, and partial reindexing on commit so retrieval doesn’t fill up with stale context.

Second, treat inference as an SLO-bound service, not a black box. Routing tasks between smaller and larger models, sharing KV cache where possible, applying backpressure, and protecting p99 latency all matter. So do the unit economics. Track tokens/user/day, cache hit rates, and cost per accepted code change.

Third, fit AI outputs into the workflows teams already use. Suggestions should land as diffs with rationale, tied into GitHub or GitLab checks, with clear references to the symbols, docs, and tests behind them. Chat-only tooling tends to age badly in disciplined teams.

And stop rewarding raw output. Reward accepted patches, fewer regressions, and faster recovery when things break.

That’s where the market is going. Greenfield’s startup list will probably reinforce it on the infrastructure side. JetBrains is making the same case inside the IDE.

The loudest AI products still sell fluency. The more useful ones respect constraints. For teams shipping real systems, that’s the direction worth paying attention to.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data engineering and cloud

Fix pipelines, data quality, cloud foundations, and reporting reliability.

Related proof

Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

How AI startup architecture is changing, according to January Ventures

Jennifer Neundorfer, managing partner at January Ventures, is set to speak at TechCrunch All Stage on July 15 at Boston’s SoWa Power Station about how AI is changing startup construction. The useful part of that argument isn’t the familiar point abou...

AI in 2026 becomes infrastructure, not spectacle

AI in 2026 looks less like a spectacle and more like infrastructure. That's better for the people who actually have to ship software, run systems, and answer for the bill. After two years of brute-force scaling, the center of gravity is shifting. Big...

ScaleOps raises $130M as AI infrastructure costs push cloud efficiency higher

ScaleOps has raised a $130 million Series C at an $800 million valuation, with Insight Partners leading and Lightspeed, NFX, Glilot Capital Partners, and Picture Capital also participating. The headline is funding. The actual point is simpler: compan...