ScaleOps raises $130M as AI infrastructure costs push cloud efficiency higher
ScaleOps has raised a $130 million Series C at an $800 million valuation, with Insight Partners leading and Lightspeed, NFX, Glilot Capital Partners, and Picture Capital also participating. The headline is funding. The actual point is simpler: compan...
ScaleOps just raised $130M on a simple bet: most AI clusters are badly underused
ScaleOps has raised a $130 million Series C at an $800 million valuation, with Insight Partners leading and Lightspeed, NFX, Glilot Capital Partners, and Picture Capital also participating. The headline is funding. The actual point is simpler: companies keep paying for GPU capacity they don't fully use.
That waste is easy to find in Kubernetes. Static requests and limits. Hand-tuned autoscaling rules. Nodes sized for worst-case traffic that rarely shows up. AI workloads make it worse. Inference traffic is uneven. GPU types vary across fleets. MIG partitions get stranded. Teams overprovision because missing p95 latency or hitting OOMs in production is worse than wasting money.
ScaleOps sells an autonomous control plane for Kubernetes that continuously adjusts compute, memory, storage, networking, and GPU allocation in real time. The company says it can cut cloud and AI infrastructure costs by up to 80%.
Treat that number like marketing until customers prove it out. The problem underneath it is real, and clearly big enough to support a company at this size.
Why the timing makes sense
For the past two years, AI infrastructure conversations have mostly been about supply. How many GPUs can you get? Which region still has H100s? Can you lock in enough capacity for training?
Now buyers are looking harder at utilization. A lot of enterprises already have expensive clusters. They want those clusters doing more work without wrecking service quality. In many cases, that's smarter than buying another pile of instances.
Inference is where this gets painful fast. Training is expensive, but operationally it's often easier to reason about. You know the topology. You know the data path. You can usually plan around preemption risk and throughput targets.
Inference is messier. Traffic swings around. Models change. Batch sizes shift. A deployment that looks fine on Tuesday morning can start missing latency targets after a product launch, a new tenant, or a change in prompt patterns. With static resource settings, teams either carry too much idle headroom or spend their time putting out fires.
That's the opening ScaleOps is chasing.
Kubernetes still leaves too much day-2 work to people
Kubernetes gives teams plenty of knobs. It doesn't make those knobs easy to run safely in production, especially with mixed AI workloads.
Most shops still have some version of this:
HPAfor replica count- static CPU and memory
requestsandlimits - cluster autoscaler or
Karpenterfor nodes - dashboards from Prometheus, Datadog, or New Relic
- someone on the platform team making judgment calls when things drift
It works until it starts costing too much, or breaks in a way nobody enjoys debugging.
The common failure mode is conservative over-allocation. Engineers size pods for peaks. They reserve extra memory to avoid OOM kills. They pin workloads to node pools and leave gaps elsewhere in the cluster. On GPU fleets, they live with fragmentation because fixing placement logic is risky and time-consuming.
ScaleOps is aiming at a bigger job than cost reporting or node autoscaling. It wants to sit in the control loop and change cluster state based on workload behavior and service targets. That category already has company. Cast AI, Kubecost, and Spot all cover real ground around cost visibility, bin packing, or node efficiency.
ScaleOps' pitch is that those pieces need to be coordinated. Resize the pod. Repack it onto a better node. Shift the MIG slice. Add or remove capacity. Check latency and error budgets after each move. Keep going.
That's a harder product to build. If it works, it's also the kind of thing enterprises will pay for.
What the platform probably has to do
ScaleOps doesn't publish a detailed architecture, but a system making the changes it describes needs a few obvious parts.
First is telemetry. CPU and memory data from cAdvisor isn't enough by itself. You need pod-level and node-level signals, probably with eBPF for better runtime visibility. On GPUs, you'd expect NVIDIA DCGM data, likely through nvidia-dcgm-exporter, plus awareness of MIG profiles, fragmentation, and device availability. You also need application signals, not just infrastructure metrics. p95 latency, error rates, queue depth, maybe even request classes if the goal is to preserve service-level objectives instead of chasing raw utilization.
Then there's the decision engine. That means changing requests and limits with production-safe guardrails, coordinating with or replacing parts of HPA, and making node lifecycle decisions similar to Karpenter or cluster autoscaler. For AI workloads, GPU-aware placement is where things get tricky. A bad placement choice can strand shards or create cold-start pain when large model weights have to be pulled again.
Then comes actuation. The system has to patch Deployment or StatefulSet specs, adjust replica behavior, move workloads, add taints or expand node pools, and back off if service health degrades. A serious implementation probably uses canary-style rollouts for profile changes, plus policy controls along the lines of OPA Gatekeeper so the optimizer can't break organizational rules.
None of this is new in isolation. Getting it to work together, safely, on messy production clusters is the hard part.
Spotting waste is easy enough. Changing resources without causing latency spikes, OOMs, or noisy-neighbor issues a few minutes later is the hard part.
AI makes the scheduling problem worse
A lot of cloud optimization tools were built around ordinary web workloads. Stateless services. Predictable scaling curves. CPU-heavy constraints.
AI serving breaks those assumptions.
Stacks like vLLM, TGI, or TensorRT-LLM depend on dynamic batching, memory-heavy KV caches, and large model artifacts that make cold starts expensive. A deployment can look underutilized by standard metrics and still be one bad scheduling decision away from a latency cliff. Pack too aggressively and p95 suffers. Pull a model onto the wrong node and startup time jumps. Split GPU capacity into awkward MIG slices and expensive hardware gets stranded.
Training has its own problems. Gang scheduling matters. Placement around NVLink or other high-speed interconnects matters. Preemption hurts. Throughput matters more than latency. Any control plane claiming it can optimize both training and inference needs to know the difference. A generic policy engine applied to everything will make bad decisions.
That makes ScaleOps founder Yodar Shafrir's background relevant. He came from Run:ai, which spent years dealing with GPU scheduling problems in the real world. That doesn't guarantee execution, but it does suggest the company understands how shared GPU fleets actually go wrong.
The competitive pressure is real
There are obvious overlaps with cloud-native tooling. AWS, GCP, and Azure already automate parts of node management. Kubernetes keeps adding better schedulers, autoscalers, and queueing options. Karpenter, Volcano, and Kueue cover meaningful parts of the stack. FinOps tools explain where the money goes. Buyers can reasonably ask why they need another control plane.
ScaleOps needs a strong answer. The pitch has to be application-aware autonomy across environments, especially on mixed GPU estates and multi-cloud setups. If the product is mostly a cleaner wrapper around existing autoscaling primitives, that won't be enough. If it can actually coordinate SLOs, scheduling, GPU fragmentation, and cost policy better than native services, then it has a case.
The timing helps. NVIDIA bought Run:ai in 2024, and that pushed some enterprise buyers toward independent orchestration layers. Plenty of companies don't want their workload control plane tied too tightly to one vendor's stack, even if most of their hardware is NVIDIA.
What engineering teams should check before buying
The pitch is easy to like. The operating model is less tidy.
Autonomous infrastructure only works when the objective function is right. In plain English, if you haven't defined service goals clearly, the optimizer will chase the wrong outcome. Cost reduction without p95 latency targets is asking for trouble. So is changing pod sizing in a GitOps-heavy environment without deciding who owns live state.
A serious evaluation should stay focused on a few boring questions:
- How does it reconcile with Argo CD or Flux when it mutates resources live?
- Can it operate in a tightly scoped namespace before touching cluster-wide node decisions?
- What rollback behavior exists when latency or error budgets degrade?
- How much RBAC power does it need?
- For GPU fleets, how does it handle MIG profile standardization, CUDA and driver compatibility, and fragmentation over time?
Security matters. Any system that can patch workloads, resize pools, and alter scheduling policy is deeply privileged. That's operationally useful and organizationally risky. Platform teams should treat it like a control-plane dependency, not a dashboard add-on.
And then there's trust. Engineers aren't attached to manual tuning for sentimental reasons. They've just seen enough "autonomous" systems optimize the metric on the sales slide and quietly make production harder to understand. ScaleOps has to show that its safety rails hold up under real load.
What investors are betting on
ScaleOps says it has customers including Adobe, Wiz, DocuSign, Salesforce, and Coupa, and reports 450% year-over-year growth. If those deployments are broad and sticky, the funding is easy to understand. Companies are under pressure to improve AI unit economics without slowing product work. A tool that materially raises GPU and cluster utilization can pay for itself quickly.
The harder question is durability. Infrastructure companies inside the control loop can become very valuable, but only if they earn trust over time and stay ahead of features cloud providers absorb into native platforms. ScaleOps has a believable wedge because Kubernetes resource tuning for AI workloads is still too manual, too fragmented, and too easy to get wrong.
For platform teams, the takeaway is straightforward. Before asking for more GPUs, check whether the cluster is actually using the ones you already have. A lot of the time, it isn't.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Fix pipelines, data quality, cloud foundations, and reporting reliability.
How pipeline modernization cut reporting delays by 63%.
Runpod says it has reached a $120 million annual revenue run rate, with 500,000 developers on the platform and infrastructure across 31 regions. For a company that started in 2021 from a Reddit post and some reused crypto mining gear, that's a sharp ...
AI in 2026 looks less like a spectacle and more like infrastructure. That's better for the people who actually have to ship software, run systems, and answer for the bill. After two years of brute-force scaling, the center of gravity is shifting. Big...
Nexos.ai has raised a €30 million Series A at a €300 million valuation, with Index Ventures and Evantic Capital co-leading the round. The startup was founded by Nord Security co-founders Tomas Okmanas and Eimantas Sabaliauskas, and its pitch is clear...