DuckDB user here. As far as I can tell, DuckDB doesn’t support distributed compu...

youngtaff · on June 25, 2022

> But it may not be the right tool if you need distributed compute over a very large dataset

I’m really interested in what the limits of DuckDB and Parquet are.

Can you give me an idea of what size you mean by “a very large dataset”

wenc · on June 25, 2022

In the distributed computing world, the rule is you start to scale horizontally when your compute workload is too large to fit in the memory of a single machine. So it depends on your compute workload and your hardware. (There’s no fixed number for what a large dataset is)

DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.