I think much of this issue can be attributed to 2 most underrated things 1. Cach...

I think much of this issue can be attributed to 2 most underrated things

1. Cache line misses. 2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )

Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.