Data Science September 29, 2025

Polars maker raises $21M as the open source DataFrame market matures

Polars, the company behind the open source DataFrame library of the same name, has raised €18 million, about $21 million, in a Series A led by Accel. Bain Capital Partners and angel investors also joined. For plenty of developers, that sounds like fu...

Polars maker raises $21M as the open source DataFrame market matures

Polars raises $21M and takes a real shot at Spark’s territory

Polars, the company behind the open source DataFrame library of the same name, has raised €18 million, about $21 million, in a Series A led by Accel. Bain Capital Partners and angel investors also joined.

For plenty of developers, that sounds like funding for a very fast Pandas alternative. The company’s actual ambition is larger. It wants to turn Polars from a strong single-node analytics engine into a full data platform. The two pieces that matter are Polars Cloud, a managed service for running Polars at scale, and Polars Distributed, now in public beta, for multi-node workloads that stretch into petabyte territory.

That’s a much harder business than building a fast DataFrame library. It’s also where infrastructure companies make money.

Why Polars got attention

Polars already solved a problem that Python data teams know well. Pandas is nice to use until the dataset no longer fits comfortably in memory, or performance falls apart under row-wise Python code and repeated materialization. Spark handles scale, but it comes with overhead, a steeper mental model, and plenty of Java-era baggage even if most users stay in Python APIs.

Polars caught on because it offered a cleaner way through that. You keep a DataFrame-style API, but the engine underneath is Rust, Arrow-native, lazily planned, vectorized, and multithreaded by default. In practice, it often feels closer to a query engine than an old-school in-memory Python library.

That design has landed. The company says Polars has passed 24 million downloads and is already in production across finance, life sciences, and logistics. Those are serious workloads.

The technical case

A lot of Polars’ speed comes from a stack of sensible engineering choices.

Rust gives the engine tight control over memory and concurrency without garbage collection pauses. Apache Arrow gives it a common columnar format that plays well with SIMD, CPU caches, and zero-copy data movement. The lazy API lets Polars build a logical plan before reading everything eagerly, so it can push filters and column selection into scans, fuse operators, and avoid dragging unnecessary data through memory.

That matters more than benchmark theater. If your code starts from scan_parquet() instead of read_parquet(), Polars can often skip work upfront. That’s why it feels fast on real jobs, not just synthetic ones.

A typical Polars pattern looks like this:

import polars as pl

result = (
pl.scan_parquet("s3://my-bucket/events/date=2025-09-28/*.parquet")
.filter(pl.col("country") == "US")
.group_by("user_id")
.agg([
pl.count().alias("events"),
pl.col("amount").sum().alias("spend")
])
.filter(pl.col("events") > 5)
.sort("spend", descending=True)
.limit(1000)
.collect(streaming=True)
)

The important part is scan_*. It keeps the query as a plan long enough for Polars to prune partitions, skip columns, and stream operators where possible. If you’ve spent years watching Pandas read whole files into memory and filter afterward, this is a real improvement.

The hard part

Single-node speed gets attention on GitHub. Distributed systems decide whether you become infrastructure.

Polars Distributed is the serious move. Once you leave one machine behind, the problems get uglier fast. You have to partition data across workers, coordinate shuffles for joins and aggregations, deal with skewed keys, recover from failures, and stop network costs from eating the gains. Spark has spent years in that world. That history still matters.

Polars does have some structural advantages. Its execution core is already columnar and Arrow-native, which should cut serialization friction between operators and nodes. Rust should help with per-node efficiency and startup behavior. And Polars already has a lazy logical plan, so extending that into a distributed physical plan makes sense. The appeal is obvious: write one Polars query, run it locally on a laptop, then push it to a cluster without switching programming models.

Still, distributed systems are where promising engines get exposed. Spark’s scheduler, shuffle behavior, retries, and ecosystem were built through a lot of production pain. Teams may complain about Spark, but they trust it because it has seen just about every ugly edge case. Polars Distributed still has to show it can survive shuffle-heavy jobs, stragglers, worker failure, and bad data distributions without falling apart or getting expensive in odd ways.

There’s a big gap between a fast engine and a dependable distributed system.

Why Polars Cloud matters

The managed service may matter more than the distributed runtime in the near term.

Polars Cloud gives the company something teams can buy without taking on cluster operations themselves. The shape is familiar enough: separate compute and storage, ephemeral workers, object store integration across S3, GCS, or Azure, plus some mix of autoscaling, credentials management, job observability, and cost controls.

That model has worked elsewhere. DuckDB built grassroots adoption on laptops, then companies like MotherDuck turned that into a managed analytics product. Polars is going after a similar opening, with a stronger Python and data engineering tilt.

It’s a sensible business. Open source gets developer adoption. Cloud revenue pays for distributed systems work, support, security, and enterprise features.

The harder question is whether buyers want another analytics control plane. Some will, if the economics are good enough. But enterprise adoption won’t turn on benchmark charts. It will turn on IAM integration, network isolation, audit logs, data residency, and how neatly Polars fits into existing metadata, governance, and observability stacks. Databricks got big because organizations wanted a place to centralize messy data work.

Pressure on Spark’s default status

Polars isn’t going to erase Spark in a year.

What it can do is shrink the set of jobs where Spark is the automatic answer. If you can keep a Pythonic DataFrame API, run against Parquet in object storage, and get strong throughput without dragging in a heavy JVM-centric environment, a lot of teams will try that first. Especially smaller platform teams that don’t want a full Databricks or Spark estate for every batch analytics task.

That pressure matters, and it explains why this round matters.

If Polars Distributed works, Spark still keeps the edge in breadth: connectors, catalogs, governance, mature scheduling, and deep enterprise muscle. But for a large chunk of ETL, feature generation, and ad hoc analytics over columnar data, “use Spark because it scales” starts to sound lazy. Plenty of teams don’t need the full Spark ecosystem. They need sane APIs, decent fault tolerance, and lower compute bills.

Polars is going straight at that middle.

What to watch before betting on it

A Polars pilot makes sense if your workloads already fit the profile:

  • mostly columnar data, especially Parquet or Arrow
  • object storage as the main source of truth
  • batch analytics, ETL, aggregations, joins
  • a team that wants to stay in Python without paying the usual performance tax

It makes less sense if your stack depends heavily on Spark-native connectors, deep governance tooling, or a long tail of battle-tested enterprise integrations.

A few technical realities matter immediately.

First, the lazy API is effectively required if you want Polars’ best performance. If engineers keep pulling data eagerly into memory and then applying Python logic row by row, most of the upside disappears.

Second, data layout still matters. Hive-style partitioning, sensible Parquet row group sizes, and avoiding a small-files mess in object storage matter here just as much as they do in every other modern analytics engine.

Third, distributed joins will surface the same old pain points: skew, shuffle volume, and memory pressure. If your workload has one hot key that owns half the table, a cleaner API won’t rescue it.

And for Polars Cloud, security due diligence is mandatory. Private networking, SSO, encryption, tenant isolation, and auditability decide whether serious companies can test it with real data.

The wider shift

Polars is part of a broader move in data infrastructure toward Rust and Arrow as the execution layer. You can see the same pull around Apache Arrow itself, DataFusion, and Meta’s Velox. The split between ergonomic local tools and ugly distributed systems has narrowed.

That doesn’t settle the winners. It does give Polars the money to push beyond the “great open source library” phase and see whether it can become a platform teams standardize on.

That’s a meaningful step. And it puts Polars in a much tougher fight than benchmark threads on X.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
Data science and analytics

Turn data into forecasting, experimentation, dashboards, and decision support.

Related proof
Growth analytics platform

How a growth analytics platform reduced decision lag across teams.

Related article
What a China Scholarship Council data science award signals about AI talent

Le Thi Ngoc Anh, a Vietnamese beauty queen, has won a full China Scholarship Council scholarship to study data science in China. The headline reads like a straightforward education story. It also points to something bigger for companies hiring ML eng...

Related article
Perplexity Finance turns market intelligence automation into a product

Perplexity is pushing further into finance, and the part that matters isn't the watchlist UI. It's automation. The pitch is straightforward. Instead of wiring together Reddit scrapers, SEC parsers, earnings transcript feeds, and price alerts yourself...

Related article
The Browser Company weighs open-sourcing Arc as it shifts to AI browser Dia

The Browser Company has effectively put Arc into maintenance mode and shifted its attention to Dia, a new browser built around AI from the start. It’s also weighing two ways to hand off Arc’s future: sell it or open-source it. That’s a blunt acknowle...