Llm May 30, 2025

TechCrunch Sessions: AI 2025 focuses on custom LLMs, multimodal systems, and cost

TechCrunch has started the one-week countdown to TechCrunch Sessions: AI 2025, scheduled for June 5 at UC Berkeley’s Zellerbach Hall. What stands out in the lineup is the agenda: custom LLMs, hybrid deployment, multimodal systems, tuning economics, a...

TechCrunch Sessions: AI 2025 focuses on custom LLMs, multimodal systems, and cost

TechCrunch Sessions: AI is focusing on the questions that matter for custom models

TechCrunch has started the one-week countdown to TechCrunch Sessions: AI 2025, scheduled for June 5 at UC Berkeley’s Zellerbach Hall. What stands out in the lineup is the agenda: custom LLMs, hybrid deployment, multimodal systems, tuning economics, and the policy and security problems that show up once a model moves past the demo stage.

That framing is a lot better than another day of leaderboard talk.

The names are familiar. Speakers from Google Cloud, Anthropic, OpenAI, and Twelve Labs are set to cover deployment, tuning, embeddings, and infrastructure. There’s also a live startup session, “So You Think You Can Pitch,” where founders get feedback from VCs on benchmarks, data strategy, and go-to-market plans. That could be useful if it stays grounded. Too many AI pitch reviews drift into pricing fiction and vague distribution talk. The source material suggests this one will spend more time on model and data decisions, which is where the hard parts usually are.

Why this agenda matters

The industry spent the last two years treating frontier model size like the main story. For most engineering teams, it isn’t. The harder problem is getting quality, latency, compliance, and cost into the same system without breaking something else.

That’s why the event’s focus on frontier model deployment and practical scale-up strategies makes sense.

If you’re shipping AI in a real product, you’re probably working through some version of this:

  • Fine-tune or prompt?
  • Hosted API or self-hosted weights?
  • On-prem data or a cloud endpoint?
  • Add retrieval or keep the system simpler?
  • Quantize and accept some quality loss, or keep paying the GPU bill?

Those are engineering decisions with business consequences. They also don’t have clean universal answers, which is why vendor-heavy conference talks can be a waste of time. The useful sessions will be the ones that admit trade-offs.

Technical topics worth watching

A few parts of the agenda line up directly with what teams are building right now.

LoRA and QLoRA still make sense for custom work

The conference materials point to LoRA and QLoRA, and that tracks. These low-rank adaptation methods are still the most accessible way to tune a model for a specific domain without paying for full retraining.

The appeal is straightforward: lower memory use, faster iteration, and solid task performance when the base model is already good enough. The source cites cost reductions above 90%, which is directionally consistent with why teams use these methods.

There’s a limit, though, and it gets glossed over all the time. Parameter-efficient fine-tuning works best when the task is fairly narrow and the data is clean. If the base model is missing core behavior, LoRA won’t fix it. Teams that use it as a substitute for better data usually end up with a brittle system.

RAG is still everywhere

The event also puts weight on retrieval-augmented generation, with references to vector stores like FAISS and ChromaDB. No surprise there. RAG is still the default move when a model needs fresher or more domain-specific context than pretraining can provide.

The pipeline is familiar:

  1. embed documents
  2. store vectors in an index
  3. embed the query
  4. retrieve the nearest chunks
  5. place those chunks into the prompt
  6. generate a response

That’s easy enough to prototype. The sample code in the source uses a FAISS index and a seq2seq model to show the pattern. Fine for teaching. Still far from a production system.

In production, RAG usually fails in mundane places:

  • bad chunking
  • retrieval that returns related but useless passages
  • prompt windows full of noise
  • missing or misleading citations
  • stale indexes
  • latency spikes because retrieval and generation were tuned separately

The better RAG discussions in 2025 are about ranking, metadata filters, hybrid search, cache policy, and evaluation. If the conference gets into that, it’ll be worth paying attention.

Multimodal work is getting practical

The agenda’s mention of multi-modal fusion, including vision transformers paired with language models, is one of the stronger signals here. With Twelve Labs involved, video understanding and search are probably part of the discussion.

That’s where a lot of practical AI work is headed: systems that can interpret screenshots, clips, diagrams, audio, and logs together.

It’s still expensive and awkward. Video indexing at scale turns into a storage and latency problem fast. Evaluation is messy too. Text benchmarks are shaky enough. Multimodal benchmarks are worse, and polished demos can hide very narrow task definitions.

Hybrid infrastructure is the normal case now

One of the better signs in the conference brief is the attention to hybrid on-prem/cloud architectures.

That matters because plenty of enterprise AI deployments have already run into the limits of pure cloud inference. It can get too expensive, too slow, or too awkward from a data-governance standpoint. Full on-prem deployment has its own problems: GPU capacity, MLOps overhead, and staffing.

So teams end up with some hybrid version of this:

  • keep sensitive data or retrieval systems on-prem
  • call cloud-hosted models for burst capacity or premium tasks
  • send lower-value inference to quantized local models
  • keep fallback paths when an external API degrades or changes pricing

It’s messy. It’s also standard.

For tech leads, the architecture question starts to look a lot like traffic engineering. You’re balancing sovereignty, p95 latency, throughput, incident response, and cost per request. Conferences often avoid that level of detail because it’s less flashy than benchmark slides. It’s still the work.

Security and policy are back on the agenda

The event also includes panels on international AI policy, defense applications, and production safeguards like red-teaming and adversarial testing. Good. Those topics were underplayed at a lot of AI events until model misuse, prompt injection, and compliance pressure forced them back in.

For engineers, the point is simple: LLM security controls have to be part of the system design.

If you’re building RAG systems, prompt injection is a design problem. If you’re fine-tuning on proprietary data, lineage and access control matter before training starts. If you’re deploying into regulated environments, auditability and fallback behavior need to be specified early.

The defense angle will make some people uneasy, fairly enough. But defense and public-sector deployments tend to force harder discussions about robustness, traceability, and failure modes. Commercial teams should listen. The same issues show up there too.

The pitch session may be one of the more honest parts

The startup pitch segment sounds secondary, but it’s worth watching because it exposes what investors are actually screening for now.

A year ago, a lot of AI fundraising still ran on broad claims about copilots, agents, and market size. That pitch has worn thin. Investors now want evidence that a team understands:

  • where its training or retrieval data comes from
  • what its inference stack costs per request
  • whether the product depends on a single model vendor
  • how performance degrades under load
  • what happens when the model is wrong

That’s a healthy shift. It pushes companies toward real systems thinking.

If founders are getting live feedback on benchmarks, data strategy, and architecture viability, that may end up being the most revealing part of the event.

What senior engineers should watch for

For developers and AI leads, the value here is pretty straightforward.

If you’re working on large-model deployment, the useful sessions will be the ones with concrete discussion of quantization, memory-efficient attention, and the actual limits of serving 70B+ parameter models. If you’re running retrieval systems, you want specifics on embedding choice, nearest-neighbor tuning, indexing overhead, and eval. If you own architecture, you want numbers on cost, latency, and governance trade-offs.

A few questions are worth carrying into any talk like this:

  • What fails under production traffic?
  • How is the system evaluated beyond benchmark screenshots?
  • How much of the stack depends on one vendor API?
  • What are the latency and memory costs after retrieval, guardrails, and observability are added?
  • What does rollback look like when model behavior shifts?

Those questions separate serious engineering from conference filler.

TechCrunch Sessions: AI looks strongest where it stays close to deployment reality. Custom models, retrieval, multimodal indexing, hybrid infrastructure, red-teaming, and cost discipline are the right subjects. If the speakers stay concrete, this should be useful for people who are actually shipping AI systems.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI agents development

Design agentic workflows with tools, guardrails, approvals, and rollout controls.

Related proof
AI support triage automation

How AI-assisted routing cut manual support triage time by 47%.

Related article
What OpenAI's GPT-4.5 immigration case reveals about AI staffing risk

A researcher who worked on GPT-4.5 at OpenAI reportedly had their green card denied after 12 years in the US and now plans to keep working from Canada. That is an immigration story. It's also a staffing, operations, and systems problem for any compan...

Related article
OpenAI retires GPT-4o as sycophancy concerns remain unresolved

OpenAI is discontinuing access to GPT-4o along with GPT-5, GPT-4.1, GPT-4.1 mini, and o4-mini. The one worth focusing on is GPT-4o. OpenAI is retiring one of its most widely used multimodal models while questions about sycophancy still hang over it. ...

Related article
Cohere acquires Ottogrid to address the data pipeline gap in enterprise AI

Cohere has acquired Ottogrid, a Vancouver startup that builds automated market research workflows. That may sound narrow. It maps directly to a problem that still trips up plenty of enterprise AI deployments: generating text is easy, feeding systems ...