Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

* If it's simple transforms, use cli tools.

* If it requires aggregation and it's small, use cli tools.

* If this is data you're using over and over again then load it in the database and then do the cleaning, ELT.

* If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel.

* If it requires massive aggregations or windows, use spark|flink|bleam.

* If you need to repeatedly process the same giant dataset use spark|flink|bleam.

* If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs.

I've been using Dlang with ldc a lot because of how fast its compile time regex is, and its built in json support. Python3+pandas is also a good choice if you don't want to use awk.



Before reaching for spark, etc:

Sort is good for aggregations that fit on disk (TBs these days, I guess)

Perl does well too if the output fits in a hashtable in DRAM, so 10’s (or maybe 100’s?) of GBs


For bzip2 why not just use pbzip2? Frankly, I wish distros would replace the stock bzip2 with pbzip2 (I think it's drop in compatible).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: