I think much of this issue can be attributed to 2 most underrated things
1. Cache line misses.
2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )
Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.
1. Cache line misses. 2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )
Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.