* If it requires aggregation and it's small, use cli tools.
* If this is data you're using over and over again then load it in the database and then do the cleaning, ELT.
* If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel.
* If it requires massive aggregations or windows, use spark|flink|bleam.
* If you need to repeatedly process the same giant dataset use spark|flink|bleam.
* If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs.
I've been using Dlang with ldc a lot because of how fast its compile time regex is, and its built in json support. Python3+pandas is also a good choice if you don't want to use awk.
* If it requires aggregation and it's small, use cli tools.
* If this is data you're using over and over again then load it in the database and then do the cleaning, ELT.
* If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel.
* If it requires massive aggregations or windows, use spark|flink|bleam.
* If you need to repeatedly process the same giant dataset use spark|flink|bleam.
* If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs.
I've been using Dlang with ldc a lot because of how fast its compile time regex is, and its built in json support. Python3+pandas is also a good choice if you don't want to use awk.