Massively Parallel Graph processing on GPUs

lmeyerov · on July 29, 2014

I was looking at this today & yesterday. Boding well, these results look right for GPUs because they're in parity w/ Duane Merril's GPU BFS work.

Less clear is whether MapGraph beats existing single node HPC systems. In particular, instead of comparing to GraphLab, I'd love to see comparisons to GraphChi or TurboGraph!

The importance here is that huge graphs are now possible on single-node, and once these techniques get combined into distributed ones, most real-world graphs can be solved in real-time!

(If you like building this kind of bleeding edge stuff in performance, ML, and/or infoviz, we're hiring at graphistry.com :)

m_mueller · on July 30, 2014

I've glanced over your paper just quickly, so I'm not sure whether I missed something: Did you analyze why your 24 threaded CPU version did so much worse in a few cases than the single threaded version? Did you use OpenMP without a defined thread affinity on a multi socket system?

erichocean · on July 30, 2014

It's using GraphLab 2.0, not 2.3 (no idea why), and they also forced synchronous mode on GraphLab, hence the less than stellar scaling.

On at least the problems I use GraphLab for, synchronous mode is not only a lot slower, it converges WAY more slowly as well.

It's important when doing comparisons to not rig the game, and it's not clear to me that the authors took pains to keep things fair.

supersonick · on July 31, 2014

We are big fans of GraphLab. We used the version that was available when we collected the performance data on GraphLab - last summer. The asynchronous engine was seg-faulting which is why we used the synchronous engine. Lots of details about this in the SIGMOD paper:

http://mapgraph.io/papers/MapGraph-SIGMOD-2014.pdf

And of course MapGraph is open-source, so anyone can make their own comparisons. :-)

ranty · on July 29, 2014

Not being that au fait with big data stuff, can someone tell me what graph processing is for?

floatrock · on July 29, 2014

One current gold-rush topic is ASIC-resistant bitcoin mining.

Bitcoin works by generating a particular type of SHA256 hash -- you take some fixed inputs (transactions), a random input you can change (a nonce), and together they need to SHA256-hash into a special hash, eg one with 5 leading zero's.

This is known as a "proof of work" system -- it has the interesting property that it is quick to verify but difficult to produce (you can't predict a SHA256 hash). Mining bitcoin is basically brute-forcing tons of random inputs until you find an input that produces a special SHA256 hash, which you then advertise to the network and the network rewards you with libertarian disney dollars.

Now, you can make a custom chip (ASIC) that is optimized to churn out as many SHA256 hashes as possible -- far more and far faster than a CPU can. In fact, ASIC mining is the only cost-effective way to mine bitcoin these days. Thus, bitcoin mining is quickly becoming about who can buy the most chips and power them with the cheapest electricity, encouraging massive centralized data farms -- a Bad Thing if you believe in the Decentralize All The Things! promise of bitcoin.

A better proof-of-work system is one that shares the quick-to-verify-difficult-to-solve property of specially-formed SHA256 hashes, but one that can't be specialized for with an ASIC. Then more people could participate on commodity hardware and you would (theoretically) have less centralization (or at least encourage better commodity hardware instead of better specialized SHA256 chips).

Anyways, back to graph processing... one class of problem that is difficult to create specialized ASICs for yet is still 'quick-to-verify-but-difficult-to-solve' is finding cycles of a specific length (say 42) in a large graph. Basically solving such a problem is memory-hard, so an ASIC can only get you so far (although people have proposed ASIC farm architectures that try, eg with shared central memory accessed by many ASIC cores, which is starting to sound a bit like a GPU...)

lmeyerov · on July 29, 2014

We're interested in day-to-day business needs. Databases and compute engines here handle real-time transaction processing or data analysis for graph data. E.g., logistics, recommendation systems, social graphs (users, employees, companies), payments (fraud or segmentation), IT systems/traffic (performance & security), knowledge management (manufacturing, legal), etc.

floatrock · on July 29, 2014

...wait, so you mean you use this stuff for like real work instead of intellectual exercises in creative incentives for burning electricity? ;)

lmeyerov · on July 29, 2014

Data art mostly: https://twitter.com/Graphistry/status/433689227906142208/pho...

ranty · on July 30, 2014

"Libertarian Disney Dollars" just about sums it up.

rotten · on July 30, 2014

Well for OLTP style processing there are hundreds of use cases. The Neo4j guys have a few on their site to give you some ideas: http://www.neotechnology.com/neo4j-use-cases/

For OLAP style processing there are also many uses. The most common is to look for clusters in large and complex sets of highly related data. (People, for example.) Cluster identification can be used in examining crime networks, disease propogation, finding likely "influencers" for marketing and political purposes, finding many-dimensional patterns of things people tend to buy, and so on. Apache Spark/GraphX is one such tool set.

Twitter uses offline graph processing to suggest who you should follow. Many companies use it (real time or offline) to suggest other items you might want to buy.

And more.

supersonick · on July 31, 2014

The really cool use case in my mind is combining OLTP+OLAP in real time. Hence the interest in GPU acceleration. We have an OLTP platform that integrates a CPU version of the GAS engine used in MapGraph:

http://bigdata.com/bigdata

http://bigdata.com/mapgraph

Eventually we'd like to see the two platforms converge.

thompsonbry · on Aug 7, 2014

Chiming in here. This is Bryan Thompson, the PI for the MapGraph project. We have new results for multiple GPUs with up to 29 billion traversed edges per second on 64 K20 GPUs and 4.3 billion directed edges. The multi-GPU code base is not yet the full MapGraph / GAS API, but we will be fixing that soon. We will be extending the capabilities of the platform in a number of directions, including allowing multiple partitions per GPU so you can operate on graphs larger than the GPU RAM without having to scale-out, and so you can operate on huge graphs on GPU clusters.

GraphLab (a purely memory based system) is much faster than GraphChi (a disk-based system using the parallel sliding window algorithm). Both GraphLab and GraphChi use "vertex cuts" which amount to 2D partitioning, which is what we use in the multi-GPU version MapGraph. This is also what is used by the world record holder for BFS traversal, which is the Blue Gene/Q.

We used the synchronous engine for GraphLab and the specified version of graphlab because that is what existed and worked (without segfaults) when we collected the data for the SIGMOD paper in the summer of 2013. We did cross-check the performance of GraphLab with the GraphLab developers.

One of the main reasons that CPU parallel graph algorithms slow down is because they hit the memory bandwidth of the bus. GPUs have 10x or more memory bandwidth. The K20 has 204 GB/s. The K40 has 288 GB/s. The Pascal GPU (2016) will have over 1 TB/s of memory bandwidth and 24 GB of RAM.

I've listed the MapGraph Publications below. See http://mapgraph.io for more details.

- MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs - Zhisong Fu, Bryan Thompson, Michael Personick (SYSTAP, LLC). GRADES 2014, June 22, 2014, Snowbird, Utah, USA. Copyright 2014 ACM 978-1-4503-2982-8. DOI:http://dx.doi.org/10.1145/2621934.2621936

- Parallel Breadth First Search on GPU Clusters - Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. SCI Technical Report, SCI Institute, University of Utah. July 29, 2014. URL:http://www.sci.utah.edu/publications/Fu2014a/UUSCI-2014-002....

tromp · on July 29, 2014

"MapGraph is up to two orders of magnitude faster than parallel CPU implementations on up 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation."

That doesn't bode well for my $1000 GPU Speed Parity Bounty at https://github.com/tromp/cuckoo based on my belief that the Cuckoo Cycle graph-theoretic proof-of-work algorithm is more suited to CPUs.

Bakkot · on July 29, 2014

I don't know. This is admittedly not something I know much about, but from poking around a bit, it looks like the Gather-Apply-Scatter model - or any other vertex-centric model - is not well suited to finding cycles of specific length. Compare eg GraphChi's triangle-counting code [1], which specifically notes the difficulties arising from having to maintain and work with large adjacency lists. I don't see a way around that for vertex-centric models.

That said, I haven't actually read your miner, so I don't know what it's doing or if something similar can be done for GAS-based frameworks. If someone who knows more wants to chime in, I'd be delighted to read more about this.

[1] https://github.com/GraphChi/graphchi-cpp/blob/master/example...

tromp · on July 30, 2014

Most time in the miner is spent identifying which edges have an endpoint of degree 1, and thus cannot be part of any cycle. This degree counting sounds like it should be right up the Gather-Apply-Scatter alley...