Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Improving MapReduce with HashFold (stevekrenzel.com)
41 points by sgk284 on June 22, 2009 | hide | past | favorite | 18 comments


Hey guys, I've been playing with this concept for a little now and thought this might be a good forum for discussing the idea.

I've got a small prototype that I used to solve a few problems (most notably: http://www.facebook.com/careers/puzzles.php?puzzle_id=8).

Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.


> "There is a lot more to this, but I'll stop there."

Go on with your explanation. I'm not grokking it yet.


Does my response to jganetsk help at all?


For the record, I found the explanation sufficient to make me understand and interest me, but I'm not qualified to assess HashFold w.r.t. MapReduce. Keep writing and keep us (on HN) updated.


Yes it does thank you.


Does the local folding process all inputs, or only those inputs whose keys would end up being local to the node?


Both. First it does a local fold on local inputs, then it does a global fold across all of the local fold results.

i.e. Assume we have three nodes and one key, "our_key", and values for "our_key" are distributed across the 3 nodes:

node1 : our_key : [ 1, 2, 3, 4, 5]

node2 : our_key : [ 6, 7, 8, 9, 10]

node3 : out_key : [11, 12, 13, 14, 15]

and our fold function ( which must be associative) is +. First we fold each node:

node1 : fold(+, [ 1, 2, 3, 4, 5]) = our_key : [15]

node2 : fold(+, [ 6, 7, 8, 9, 10]) = our_key : [40]

node3 : fold(+, [11, 12, 13, 14, 15]) = our_key : [65]

When each node is done with its local inputs, the key-value pairs are redistributed so that each key and its values are on a single node. In this case, assume our hash function moved "our_key" to node 2. Node 2 would now have:

node2 : our_key : [15, 40, 65]

And we would fold against that and get:

node2: our_key : 120

This is how we achieve massive scalability. The nodes can do all of their processing in parallel without any intercommunication except for the key-redistribution phase.

Two key things to point out are that HashFold does the folding as the data becomes available (we don't wait for all values to start folding... this reduces memory usage because we don't need to store the list of values, only the result of the fold as it progresses), and the framework described here is sufficient to perform any computation that MapReduce can do which should make the transition easier (as a lot of major data processing tasks are currently framed for MapReduce)


How do you build the hash table at the first place? "We look up the key in the hash table and get a value (referred to as v2)."

Do you know Haskell a bit? Try to write your examples in real Haskell, a friendly type system helps a lot to clean up your pseudo-code.


Think of the hashtable as a defaultdict or the fold as having an initial value for when the key doesn't have a value yet.


Sure, but what value is it?


Ah, that would be specific to the job. If your doing a sum, maybe 0, or if you're building a list, it'd be an empty list. I suspect that a good default would just be whatever the first value generated by the mapper is, in which case the first value of any key wouldn't be folded, but rather just set in the hash table. But that is all job specific. If you can recommend a better solution, I'm all ears.

Furthermore, I should have explained the types better. As described, the HashFold framework would require the output of the map, the arguments of the fold, and the output of the fold all be the same type. This isn't as limiting as it sounds and if it turns out to be limiting for some problem domains, there are solutions to each of those.


Thanks, that makes sense.


do you have any code to play with?


Here ( http://files.getdropbox.com/u/196807/peaktraffic.tgz ) is a proof of concept of HashFold that I used for my solution for the facebook puzzle ( http://www.facebook.com/careers/puzzles.php?puzzle_id=8 ).

It's written in python. You can run it with './peaktraffic input'. It has 5 different use cases for HashFold: a counter, two different filters, an adjacency list builder, and a cluster finder. The README explains the solution, but does so in terms of MapReduce.

This code wasn't written with the intent of public consumption, so take it as you may.

Any questions, let me know.


I'm very excited to see work like this. There's no reason to accept MapReduce as the best type signature for structuring distributed, parallelizable algorithms. Most of the justification is that it "feels right", and "just works".


Thanks jganetsk, words of encouragement are always appreciated.


> So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).

Not so fast. Yes, mapreduce stores all "unapplied" key-value pairs at the reducer. However, HashFold does as well, the big difference being that HashFold will start applying pairs as it sees them. While that's a win on associative functions, it's at best a tie on unassociative functions.


Completely agree. I should have elaborated more on that, thanks for bringing it up.

In many cases you can have a significantly lower memory profile. In a worst case it's the same as MapReduce (as you said). I think having the additional flexibility with memory, in addition to the simpler architecture and performance attributes, makes HashFold an attractive alternative.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: