The Graph with a Trillion Edges

After looking at this thread on Hacker News about this paper from a bunch of Facebook researchers, and this excellent article by Frank McSherry, I tried to understand McSherry’s approach (since I found the paper kind of impenetrable) and this is my take on it.

The Basic Problem

Let’s suppose you’re Facebook and you have about 1B members who have an arbitrary number of links to each other. How do you scale and parallelize queries (especially iterated queries) on the data?

McSherry basically implements a single thread algorithm on his laptop using brute force. (He uses a publicly available database of 128B edges.) As I understand it, his approach is pretty simple:

  • Encode each connection as a 64 bit number (think connection from A (32 bits) to B (32 bits).
  • Store the numbers in RAM as a sequence of variable-length integer encoded differences. E.g. a set of numbers beginning with 4, 100, 2003, 2005,… would be encoded as 4, 96, 1903, 2,… Since the average distance between 128B values scattered among 1T is 8, the expected distance between connections (64-bit values) will be around 9, which we can encode as ~4 bits of data (using variable-length encoding) instead of 64 bits, and storage needed is proportional to number of values stored [1]).
  • Look for a connection from A to anyone by running through the list and looking for values starting with A’s 32 bits. (You can make this search arbitrarily efficient — trading memory requirements for performance — by partitioning the list into a lookup table.)
  • Store the numbers on disk as deltas between the numbers encoded as a Hilbert Curve [2] using variable-length integer encoding for efficiency. (The naive encoding has an average of ~5bits between values; the optimized encoding was 2.

[1] Each entry will be log(n) bits in size, there’ll be c of them, and the average difference (to be encoded) will be log(c/n) bits. If we assume c scales as n^2 (highly dubious — it means if the world’s population increased by a factor of 2, I’d have twice as many friends) then we get n^2 * log(n) for storage. If we assume c scales as n log(n) (probably still generous) then we get n log(n) * log(log(n)). At any given scale that function looks pretty linear, although it is actually faster than linear — the gradient hangs around 5 for n between 1B and 100B — I don’t think it’s a cause for concern.

[2] A simple way of looking at Hilbert Curve encoding is that it treats a binary number as a series of 2-bit numbers, with each pair of bits selecting a sector of a 2×2 square (in a somewhat non-obvious way). So all numbers starting with a given pair of bits are found in the same sector. Then, look at the next two bits and subdivide that sector in a similarly non-obvious way until you run out of bits. (In an earlier post, McSherry explains Hilbert Curve implementation a graphically.) Incidentally, simply interleaving the bits of the two numbers has much the same effect.

That’s it in a nutshell. Obviously there are plenty of simple ways to optimize the lookups.

He points out that the database he used comes in 700 files. These could each be processed by 700 different computers (stored slightly less efficiently, because the average distance between deltas goes up slightly) and any request could easily be split among them.

The interesting thing is that running in one thread on a laptop, this approach runs faster and scales better than the 128 core systems he benchmarks against.

Optimizing

So let’s suppose I wanted to emulate Facebook and have access to 256 virtual machines. (I could easily afford to do this, as a startup, using AWS or similar.) How would I do this in practice? Remember that Facebook also has to deal with users creating new connections all the time.

First of all (kind of obviously) every connection gets stored twice (i.e. connections are two-way). We need this for worst-case scenarios.

Let’s suppose I number each of my virtual machines with an 8-bit value. When I store a connection between A and B I encode the connection twice (As AB and BA) and take the last 8 bits of one value and record them as a connection on machines with the corresponding number. Each machine stores the value in its representation of its link database.

How would this work in practice?

Well, each value being stored is effectively 56-bits (8 bits are pulled off the 64-bit value to pick the machine are and thus the same). Let’s divide that into 32-bits and 24-bits and store (in some efficient manner) 2^24 lists of 32-bit numbers, again stored in some efficient manner (we can break down and use McSherry’s approach at this point — effectively we’re progressively using McSherry’s approach on different portions of the numbers anyway).

So any lookup will entail grabbing 24-bits of the remaining 56-bits leaving us with 1/2^32 of the original data to search. Our efficient encoding algorithm would be to split the remaining data using the same approach among the cores, but I’m sure you get the idea by now.

So in a nutshell:

To record the existence of the edge AB:

  1. Encode as 2 64-bit values.
  2. Reduce to two 56-bit values and 2 8-bit values.
  3. Ask the 2 (or occasionally 1) machines designated for the 2 8-bit values.
  4. Each machine reduces the 56-bit value to a 32-bit value with a 24-bit lookup.
  5. Then inserts the remaining 32-bit value in a list of ~E/2^32 values (where E is the total number of edges, so 232 values for a 1T vertex DB)

Inserting a value in a list of n values is a solved O(log(n)) problem. Note that all connections Ax will be stored on one machine, but connections xA will be scattered (or vice versa) because we’re storing two-way connections. (Important for the worst case, see below.)

To find all edges from A:

  1. Grab 8-bits using the same algorithm used for storing AB.
  2. Ask the designated machine for the list of ~E/2^32 connections from A.

So to sum up — storing a connection AB involves bothering two machines. Determining if A’s connections involves bothering 1 machine in most cases, and (see below) all the machines in rare cases. However, the rare cases should never actually exist.

Worst Case Response

One problem here is that the worst case response could be huge. If A has 10M outgoing edges then one machine is going to have to cough up one giant database. (If one node has more connections than can easily be stored on one machine then we can deal with that by getting our machine names from the entropic bits of both user ids, but let’s ignore that for now.)

In sufficiently bad cases we reverse the lookup. And reverse lookups will never be bad! If we hit the machine containing all of Ax for all connections of the form Ax — there are too many to deal with, so the machine tells us to reverse the lookup, and we ask all our machines for connections xA, and we can reasonably expect those to be evenly distributed (if 256 bots sign up for and follow A simultaneously, there will be 256 new entries of the form Ax on one machine, but only one connection of the form xA on each machine).

So, the worst case performance of this design comes down to the cost of transmitting the results (which we can evenly distribute across our machines) — you can’t really do better than that, and the key step is treating each machine at having responsibility for one square in a grid picked using Hilbert Curves.

Note that the worst case isn’t terribly bad if we just want to count connections, so interrogating the database to find out how many outgoing links A has is practical.

Anyway, that never happens…

One can assume in the case of most social networks that while there will often be worst cases of the form A follows B (i.e. B has 10M followers) there will never be cases where A follows 10M users (indeed anyone like that could be banned for abuse long, long before that point is reached).

It follows that when someone posts something (on Twitter say) it only gets pulled by followers (it doesn’t get pushed to them). A posts something — solved problem, just update one list. B asks for update — we need to check B’s followers — solved problem. So we don’t even need to store two-way connections or worry about reverse lookups.

Incidentally…

This is apparently very similar to PageRank (Google’s algorithm for evaluating websites), which I’ve never bothered to try to understand.

Consider that each edge represents A links to B, and we rank a page by how many incoming links it has and we throw out pages with too many outgoing links (i.e. link farms, sitemaps). Then we iterate, weighting the value by the previously calculated values of each page. Google (et al) add to that by looking at the text in and around the link (e.g. “publicly available database of 128B edges”, above, links to a page about exactly that, so Google might infer that the linked page is “about” things in those words (or in the case of “click here” type links, look at words around the link).

So if from incoming links we infer that a page is about “scalable algorithms”, and from the quality of the pages with incoming links the quality of the page itself — and we maintain a database of connections between topics and pages — then when someone searches for “scalable algorithms”, we take the id of “scalable” and algorithms”, find pages about both, sort them by “quality”, create a web page, and festoon it with sponsored links and ads. By golly we’ve got a search engine!

MacHeist Nanobundle 2

Well, Macheist has come and gone again, and now I have a couple of gigabytes of new software (mostly Monkey Island) on my notebook’s hard drive. The usual rule with Macheist (and similar deals) is that you only buy it if there’s a product you’d cheerfully pay the fee for in the bundle, and on that basis this bundle was a great deal for me: I’m a sucker for Monkey Island (even though I never really cared for the threequel). I’m also glad to see Telltale Games shipping Mac products (I hope they port the Sam & Max titles: I will cheerfully pay retail for any Sam & Max title until I become jaded, but I’ll probably buy the Wii version otherwise…).

I did end up installing all of the other programs, although some got uninstalled pretty darn quickly.

MacJournal is a really huge program for keeping a journal. I have a cloud-based solution for doing this called WordPress and — unlike MacJournal — it is accessible from anywhere (including my iPhone), it’s free (and open source), it lets me make some journal entries public while keeping others private, has a comment system, does version control, automatically backs up to the cloud, and doesn’t take up a metric buttload of hard disk space. (Uninstalled)

As a side note, MacJournal is a fine example of an attractive, functional, easy-to-use useless piece of software the like of which does not exist for Windows. If you found a niche product like this for Windows it would be a horrible piece of crap. MacJournal is quite lovely — it’s just not useful to me. All of the pieces of software in the Macheist bundle that I’ve installed and used have been very polished, stable products. It’s a testament to the quality of Apple’s indie software ecology, and I think it must be quite terrifying for Microsoft which cannot itself produce such polished products let alone attract third parties to do so.

Ripit is a program that does one thing (rip DVDs to hard disk) and does it very, very well. I have not quite reached the point of ripping my entire DVD library but when I do, I’ll be glad I got a license for this. (Installed but not used, yet.)

Clips is an intriguing little hack that monitors you clipboard and then automatically keeps the last N clipboards around for use at the touch of a key. I think this is a great idea and pretty well-implemented, but it just never occurs to me to use it. I’m running it though and maybe, one day, I’ll actually use it. It’s a lot like multiple-undo, I think — one day you’ll realize you (a) use it all the time and (b) get enormously annoyed by a program that doesn’t have it. (Installed, running, but not used yet.)

CoverScout is an intriguing iTunes add-on. I haven’t installed it yet but I have high hopes that it will actually help sort out my iTunes cover art situation (my wife and I ripped our entire CD collection two house moves ago, and many of the tracks have very odd cover art having been incorrectly identified by iTunes at some point. As I understand it, CoverScout’s sole purpose in life is to fix this kind of thing, so I’m hoping it’s good at it. (Not yet installed.)

Flow I’ve already discussed. I think I may be in love. At minimum, Flow makes Little Snapper irrelevant by doing what Little Snapper does for screenshots for — basically — everything. I still use Transmit without thinking, though. (Installed, used, kept.)

Rapidweaver is a program I’ve considered and rejected in the past. It’s a very similar program to Sandvox (which I also own and don’t use), perhaps a little better put together and with generally more attractive (and, as far as I can tell, flexible) themes. Unlike Sandvox, it seems to have built up a fairly solid third-party plugin ecology and might actually be a useful product for someone looking for a template-based web development tool. More attractive and flexible than Sandvox, produces much lighter weight pages than iWeb (although also much less flexible graphically). Rapidweaver has also been sitting at version 4.3.1 for a rather long time (it used to be one of those programs that would get revved every few weeks) — perhaps the developers are losing interest. (Installed, messed around with, probably will be uninstalled.)

Tweetie is one of those non-solutions to non-problems. Indeed, since it’s a desktop Twitter client it’s something of a meta-non-solution to a meta-non-problem. I installed it and played with it for a few minutes — the fact that it was not especially obvious how to make a new tweet was a very discouraging sign (I did figure it out…). But at least it’s small. (Installed, used, kept… for some reason. Oh, that’s right, it’s 2MB.)

The Macheist folks also snuck in three bonus programs for promoting them via Twitter (further alienating me from Twitter). One of the programs — Tracks (installed, used, kept) — is a very well thought out iTunes remote (in particular it offers Spotlight-like access to your iTunes library from a menubar widget) but the other two — Airburst Extreme (Uninstalled) and Burning Monkey Solitaire (Not downloaded) — are wastes of hard disk space as far as I’m concerned.

Twitter vs. Email

Since this is my day to post X vs. Y entries, Twitter has hit 50M tweets per day. According to About.com as of 2008 there were 183 billion email messages being sent each day by 1.3 billion people of which around 70% is spam or viruses (which I found surprisingly low; but then I suspect Twitter’s signal/noise ratio is even worse and there are no good filters). Twitter has been growing at a rate of 400% every six months, so at its current rate of expansion it should hit 183 billion tweets in three years, give or take. (Email growth is considerably slower—it averaged 14.6% from 2001 to 2007.)

Of course, in three years we’ll all be using Google Wave on our Windows Tablet 7 Series Tablets.