Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DuckDB user here. As far as I can tell, DuckDB doesn’t support distributed computation so you have to set that up yourself, whereas Athena is essentially Presto — it handles that detail for you. It also doesn’t support Avro or Orc yet.

DuckDB excels at single machine compute where everything fits in memory or is streamable (data can be local or on S3) — it’s lightweight and vectorized. I use it in Jupyter notebooks and in Python code.

But it may not be the right tool if you need distributed compute over a very large dataset.



> But it may not be the right tool if you need distributed compute over a very large dataset

I’m really interested in what the limits of DuckDB and Parquet are.

Can you give me an idea of what size you mean by “a very large dataset”


In the distributed computing world, the rule is you start to scale horizontally when your compute workload is too large to fit in the memory of a single machine. So it depends on your compute workload and your hardware. (There’s no fixed number for what a large dataset is)

DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: