What Laude Institute's first Slingshots AI grants are actually funding
Laude Institute has announced the first batch of its Slingshots AI grants, backing 15 projects meant to “advance the science and practice of AI.” The name is secondary. The underlying bet is more interesting. Laude is putting money, compute, and engi...
Laude’s first Slingshots grants target the weakest part of AI today: evaluation
Laude Institute has announced the first batch of its Slingshots AI grants, backing 15 projects meant to “advance the science and practice of AI.” The name is secondary. The underlying bet is more interesting.
Laude is putting money, compute, and engineering support into AI evaluation, an area that still lags far behind model training and product demos. That gap is getting harder to ignore as coding agents, enterprise copilots, and tool-using systems move into production.
A lot of the market still runs on cherry-picked examples, private scorecards, and static benchmarks that say very little about how agents behave outside a lab. Slingshots is funding some of the infrastructure needed to fix that.
That matters.
Why this batch stands out
The first Slingshots cohort covers a specific set of problems:
- Terminal Bench for evaluating coding agents in a real shell environment
- A newer ARC-AGI effort around abstract reasoning and generalization
- Formula Code for code optimization tasks
- BizBench for white-collar enterprise workflows
- Work on reinforcement learning structures
- Work on model compression
- CodeClash, led by a co-founder of SWE-Bench, using competition-style code evaluation
That list is a decent map of where the field is stuck.
The hard part of AI in 2026 is measuring whether an agent can complete a messy task, recover from failure, use tools safely, stay within budget, and keep doing that as conditions change. Static test sets don't capture much of that.
That problem gets sharper with agents. Once a model can call tools, write files, run tests, or interact with ERP and CRM systems, you're no longer judging a single text output. You're judging a system with state, permissions, side effects, and a long list of failure modes.
The industry has been slow to catch up. Slingshots looks like an attempt to move evaluation forward before deployment gets even farther ahead of measurement.
From benchmark scores to real environments
Projects like Terminal Bench show where evaluation needs to go.
If you want to assess a coding agent, a multiple-choice Python question doesn't tell you much. You want to see whether it can work in a pseudo-terminal, use git, inspect files, run pytest, recover from bad commands, and finish the job without trashing the repo. That requires sandboxed execution, deterministic seeds, pinned dependencies, restricted network access, and full trace capture.
A useful score also has to be more than pass or fail.
For terminal tasks, the metrics that matter look more like:
task_completion_ratetime_to_fixcommand_error_rate- recovery behavior after failed steps
- filesystem safety and destructive action avoidance
That's much closer to real use. It's also much harder to build and maintain.
The same pattern shows up in BizBench, which targets enterprise workflows across CRM, ERP, spreadsheets, email, and compliance checks. If an agent extracts data correctly but routes it to the wrong record, drafts an inaccurate summary, or ignores access boundaries, that failure is the substance of the eval.
A lot of public AI discussion still treats “agent performance” as a single number. For enterprise systems, that doesn't hold up. You need task success, error severity, confidence calibration, handoff quality, and auditability. Security controls belong inside the eval setup too. Role-based permissions, PII handling, and audit logs can't be afterthoughts.
Code evaluation is getting more adversarial
CodeClash could end up being one of the more important projects in the batch.
SWE-Bench helped push software engineering evaluation toward real repos and actual issue resolution. But fixed benchmarks age quickly. Once frontier models and agent stacks start optimizing for them, leaderboard gains get harder to trust.
CodeClash points toward a competition-style setup, likely borrowing ideas like ELO ratings, adversarial challenge generation, mutation testing, and rolling addition of newly discovered failure cases. That's healthier than letting a benchmark turn into a memorized exam.
There is a trade-off. Dynamic evals are harder to compare over time. If tasks keep changing, longitudinal tracking gets messy. Teams will need versioning, audit trails, and documentation strong enough to make the results usable for procurement and model governance.
That tension isn't going away. Evals need to evolve enough to reduce overfitting, while staying stable enough to compare models across quarters.
Formula Code gets at a neglected problem
Most code benchmarks still treat correctness as the finish line. In real engineering work, it often isn't.
Formula Code focuses on code optimization, which is more useful than another pass/fail coding suite. If an agent changes code that still passes tests but worsens runtime, memory usage, cache behavior, or binary size, it hasn't done much good. It may have made the system worse.
A serious optimization benchmark should track metrics like:
runtimememory_footprintbinary_sizecache_misses- potentially
energy_consumption
That implies real tooling. LLVM/Clang, BOLT, gprof, Valgrind, perf, and CI-backed microbenchmarks all fit here.
It also creates an obvious failure mode: metric gaming. Agents can learn superficial transformations that score well on narrow microbenchmarks without improving production behavior. Randomized inputs, profile-guided evaluation, and regression checks in CI are the standard defenses, but they take real maintenance work.
This is exactly the sort of eval work the ecosystem needs. Optimization is where model claims get fuzzy fast.
Compression and RL matter for different reasons
The batch also includes work on reinforcement learning structures and model compression. Those projects may sound less glamorous than coding-agent benchmarks, but they matter if deployment economics matter.
For RL-related work, the appeal is clearer ways to standardize trajectories, rewards, and tool-use sequences so models can be compared on something firmer than anecdotal agent runs. The risk is familiar. If the reward spec is sloppy, the model will exploit it.
Compression has a shorter path to practical value. Quantization, pruning, and distillation are established ideas, but teams still cut corners on evaluation. They'll measure token throughput and maybe top-line accuracy, then stop there.
That misses the real question: what capability do you lose under budget constraints, and where do failures show up first? A useful compression eval should track accuracy_delta alongside latency, VRAM use, energy, and deployment targets. If you're shipping to commodity GPUs or edge hardware, you also care about low-memory behavior and hardware-level faults.
Cheap models are only useful if they're still dependable enough to trust.
What engineering teams should take from this
If you're building with agents now, the practical takeaway is simple: evaluation needs to sit inside the delivery pipeline, not in a quarterly research notebook.
A solid starting point looks like this:
- Containerized eval environments with
Dockeror OCI-compatible runtimes - Reproducible seeds and pinned toolchains
- Full execution traces including
stdout,stderr, command history, resource usage, and network calls - Multi-metric scoring that mixes correctness, latency, cost per task, and severity of failure
- Hidden constraints, mutation tests, and randomized inputs to reduce benchmark gaming
- Long-session degradation tests for agents that drift over time
- Security instrumentation for permissions, data access, and destructive actions
For enterprise workflows, add confidence calibration. An agent that's wrong and highly confident creates a different operational risk than a low-confidence miss. Metrics like ECE and Brier score aren't academic extras. They help decide when to automate, when to require approval, and when to hand work to a human.
If you're evaluating vendors, this matters too. Third-party benchmarks with transparent protocols are becoming part of due diligence. Vendor demos and bespoke internal scorecards are easy to spin. Public leaderboards don't tell the whole story, but they're still better than screenshots from a sales deck.
Pressure on the market
Laude's grant program won't fix evaluation fragmentation on its own. If every lab, startup, and enterprise builds a private benchmark suite, comparison gets weaker and marketing gets louder.
Still, this cohort is pointed at the right problem. It treats benchmarks, harnesses, and eval protocols as infrastructure. That's overdue.
The AI industry has spent the past two years acting as if model capability naturally turns into deployable systems. It doesn't. What's missing is rigorous measurement of behavior under realistic conditions, with costs, controls, and failure modes included.
That's what Slingshots is paying for.
And right now, the boring part may be the part with the most value.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Build search and retrieval systems that ground answers in the right sources.
How grounded search reduced document lookup time.
Yann LeCun’s new company, AMI Labs, has raised $1.03 billion at a $3.5 billion pre-money valuation to build world models. That's a huge round for a company openly saying it won't chase near-term revenue, and it says a lot about where serious AI money...
Yann LeCun is reportedly preparing to leave Meta and start a company focused on world models. If that happens, it lands as a management story, a research story, and a product story at the same time. At Meta, LeCun has been the clearest internal criti...
Converge Bio has raised $25 million, with Bessemer leading and executives from Meta, OpenAI, and Wiz backing the company. Plenty of AI biotech startups can still raise money. The more useful signal is what Converge says it's selling. The company says...