Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.
For the record, I found the explanation sufficient to make me understand and interest me, but I'm not qualified to assess HashFold w.r.t. MapReduce. Keep writing and keep us (on HN) updated.
When each node is done with its local inputs, the key-value pairs are redistributed so that each key and its values are on a single node. In this case, assume our hash function moved "our_key" to node 2. Node 2 would now have:
node2 : our_key : [15, 40, 65]
And we would fold against that and get:
node2: our_key : 120
This is how we achieve massive scalability. The nodes can do all of their processing in parallel without any intercommunication except for the key-redistribution phase.
Two key things to point out are that HashFold does the folding as the data becomes available (we don't wait for all values to start folding... this reduces memory usage because we don't need to store the list of values, only the result of the fold as it progresses), and the framework described here is sufficient to perform any computation that MapReduce can do which should make the transition easier (as a lot of major data processing tasks are currently framed for MapReduce)
Ah, that would be specific to the job. If your doing a sum, maybe 0, or if you're building a list, it'd be an empty list. I suspect that a good default would just be whatever the first value generated by the mapper is, in which case the first value of any key wouldn't be folded, but rather just set in the hash table. But that is all job specific. If you can recommend a better solution, I'm all ears.
Furthermore, I should have explained the types better. As described, the HashFold framework would require the output of the map, the arguments of the fold, and the output of the fold all be the same type. This isn't as limiting as it sounds and if it turns out to be limiting for some problem domains, there are solutions to each of those.
It's written in python. You can run it with './peaktraffic input'. It has 5 different use cases for HashFold: a counter, two different filters, an adjacency list builder, and a cluster finder. The README explains the solution, but does so in terms of MapReduce.
This code wasn't written with the intent of public consumption, so take it as you may.
I'm very excited to see work like this. There's no reason to accept MapReduce as the best type signature for structuring distributed, parallelizable algorithms. Most of the justification is that it "feels right", and "just works".
> So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).
Not so fast. Yes, mapreduce stores all "unapplied" key-value pairs at the reducer. However, HashFold does as well, the big difference being that HashFold will start applying pairs as it sees them. While that's a win on associative functions, it's at best a tie on unassociative functions.
Completely agree. I should have elaborated more on that, thanks for bringing it up.
In many cases you can have a significantly lower memory profile. In a worst case it's the same as MapReduce (as you said). I think having the additional flexibility with memory, in addition to the simpler architecture and performance attributes, makes HashFold an attractive alternative.
I've got a small prototype that I used to solve a few problems (most notably: http://www.facebook.com/careers/puzzles.php?puzzle_id=8).
Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.