Improving MapReduce with HashFold

sgk284 · on June 22, 2009

Hey guys, I've been playing with this concept for a little now and thought this might be a good forum for discussing the idea.

I've got a small prototype that I used to solve a few problems (most notably: http://www.facebook.com/careers/puzzles.php?puzzle_id=8).

Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.

rw · on June 22, 2009

> "There is a lot more to this, but I'll stop there."

Go on with your explanation. I'm not grokking it yet.

sgk284 · on June 22, 2009

Does my response to jganetsk help at all?

sqs · on June 23, 2009

For the record, I found the explanation sufficient to make me understand and interest me, but I'm not qualified to assess HashFold w.r.t. MapReduce. Keep writing and keep us (on HN) updated.

rw · on June 23, 2009

Yes it does thank you.

jganetsk · on June 22, 2009

Does the local folding process all inputs, or only those inputs whose keys would end up being local to the node?

sgk284 · on June 22, 2009

Both. First it does a local fold on local inputs, then it does a global fold across all of the local fold results.

i.e. Assume we have three nodes and one key, "our_key", and values for "our_key" are distributed across the 3 nodes:

node1 : our_key : [ 1, 2, 3, 4, 5]

node2 : our_key : [ 6, 7, 8, 9, 10]

node3 : out_key : [11, 12, 13, 14, 15]

and our fold function ( which must be associative) is +. First we fold each node:

node1 : fold(+, [ 1, 2, 3, 4, 5]) = our_key : [15]

node2 : fold(+, [ 6, 7, 8, 9, 10]) = our_key : [40]

node3 : fold(+, [11, 12, 13, 14, 15]) = our_key : [65]

When each node is done with its local inputs, the key-value pairs are redistributed so that each key and its values are on a single node. In this case, assume our hash function moved "our_key" to node 2. Node 2 would now have:

node2 : our_key : [15, 40, 65]

And we would fold against that and get:

node2: our_key : 120

This is how we achieve massive scalability. The nodes can do all of their processing in parallel without any intercommunication except for the key-redistribution phase.

Two key things to point out are that HashFold does the folding as the data becomes available (we don't wait for all values to start folding... this reduces memory usage because we don't need to store the list of values, only the result of the fold as it progresses), and the framework described here is sufficient to perform any computation that MapReduce can do which should make the transition easier (as a lot of major data processing tasks are currently framed for MapReduce)

babo · on June 22, 2009

How do you build the hash table at the first place? "We look up the key in the hash table and get a value (referred to as v2)."

Do you know Haskell a bit? Try to write your examples in real Haskell, a friendly type system helps a lot to clean up your pseudo-code.

sgk284 · on June 23, 2009

Think of the hashtable as a defaultdict or the fold as having an initial value for when the key doesn't have a value yet.

babo · on June 23, 2009

Sure, but what value is it?

sgk284 · on June 23, 2009

Ah, that would be specific to the job. If your doing a sum, maybe 0, or if you're building a list, it'd be an empty list. I suspect that a good default would just be whatever the first value generated by the mapper is, in which case the first value of any key wouldn't be folded, but rather just set in the hash table. But that is all job specific. If you can recommend a better solution, I'm all ears.

Furthermore, I should have explained the types better. As described, the HashFold framework would require the output of the map, the arguments of the fold, and the output of the fold all be the same type. This isn't as limiting as it sounds and if it turns out to be limiting for some problem domains, there are solutions to each of those.

babo · on June 23, 2009

Thanks, that makes sense.

leecho0 · on June 22, 2009

do you have any code to play with?

sgk284 · on June 22, 2009

Here ( http://files.getdropbox.com/u/196807/peaktraffic.tgz ) is a proof of concept of HashFold that I used for my solution for the facebook puzzle ( http://www.facebook.com/careers/puzzles.php?puzzle_id=8 ).

It's written in python. You can run it with './peaktraffic input'. It has 5 different use cases for HashFold: a counter, two different filters, an adjacency list builder, and a cluster finder. The README explains the solution, but does so in terms of MapReduce.

This code wasn't written with the intent of public consumption, so take it as you may.

Any questions, let me know.

jganetsk · on June 22, 2009

I'm very excited to see work like this. There's no reason to accept MapReduce as the best type signature for structuring distributed, parallelizable algorithms. Most of the justification is that it "feels right", and "just works".

sgk284 · on June 23, 2009

Thanks jganetsk, words of encouragement are always appreciated.

anamax · on June 22, 2009

> So I claimed that HashFold can be more memory efficient than MapReduce. I make this claim because MapReduce needs to store all of the key-value pairs generated by the mapper, whereas HashFold only needs to store one key-value pair at any given time (in addition to the hash-table).

Not so fast. Yes, mapreduce stores all "unapplied" key-value pairs at the reducer. However, HashFold does as well, the big difference being that HashFold will start applying pairs as it sees them. While that's a win on associative functions, it's at best a tie on unassociative functions.

sgk284 · on June 23, 2009

Completely agree. I should have elaborated more on that, thanks for bringing it up.

In many cases you can have a significantly lower memory profile. In a worst case it's the same as MapReduce (as you said). I think having the additional flexibility with memory, in addition to the simpler architecture and performance attributes, makes HashFold an attractive alternative.