Matt Gemmell

Raw Materials book cover image

My book Raw Materials is out now!

A collection of personal essays, with exclusive content and author's notes.

» Find out more

Hashing for privacy in social apps

13 min read

Three days ago, Arun Thampi blogged about his discovery that the Path social networking app uploads the user’s entire iPhone address book to its servers. There’s been extensive industry media coverage of this, and a new version of Path has been released which now asks for permission before uploading your contacts’ information. The CEO of Path, Dave Morin (@DaveMorin), also apologised for the fiasco in a blog post.

Once the story broke, I noted that Morin had posted a comment on Thampi’s piece, and I replied to that comment, asking (amongst other things) why they hadn’t used hashing instead of uploading the raw contact information. Morin’s reply to me was straightforward and honest on that point, if somewhat surprising:

This is a good alternative solution which we'll look into. Thanks for the idea.

Morin is not a “youngster” - an article from 28th December 2011 says that he’s 30 years old - so we’re not talking about a brash teenager. Nor are we talking about a management figure with no technical experience; the same article describes his previous role as a key platform engineer at Facebook.

I’m 32 (and a half) years old, at time of writing in early February 2012, so Morin and I work in the same field and are of approximately the same age. The difference is, I not only immediately thought of hashing as an appropriate measure, but was shocked that Path hadn’t implemented their app and servers that way. Nonetheless, this isn’t an isolated example.

From talking to many developers about this privacy intrusion during the past week, it quickly became disturbingly clear to me that many aren’t familiar with hashing at all. This is also predictably (and entirely forgivably) true for the many journalists who have covered the story, unintentionally distorting the issue due to lack of education in the field.

This article, therefore, aims to introduce the concept of hashing in a clear, straightforward, and no-degree-required way, suitable for journalists and casual readers as well as programmers and software engineers. I’ll also explain why it’s suitable for preserving the privacy of contact information whilst still allowing for social functionality, and I’ll touch on whether or not you really need to store that contact information (hashed or not) in the first place.

The false dilemma of privacy vs social features

Let’s consider how the Path (et al) privacy uproar started. It’s simple enough to understand:

  1. Path is a social network, and wants to make it easy for new users to connect with their friends.
  2. An obvious way to do this is to check whether any existing Path user is also a friend of a person who’s just joined. Let’s call this new user Bob, and let’s say there’s an existing Path user called Jane, who Bob happens to know outside of Path.
  3. It’s reasonable to assume that if Bob has Jane’s email address on his iPhone, then Jane is in some sense his friend, and that Bob might want to be connected with Jane on Path.
  4. Thus, a naive developer of the Path app would copy all the email addresses from Bob’s address book on his iPhone, upload them to Path’s servers, search for any matches amongst Path’s other users, and offer to connect Bob to those people. Simple.

Herein lies the (false) dilemma: you’re getting a handy social feature (automatic connection with your friends), but you’re losing your privacy (by allowing your friends’ email addresses to be uploaded to Path’s servers). As a matter of fact, your friends are also losing their privacy too.

What an awful choice to have to make! If only there was a third option!

For fun, let’s have a think about what that third option would be.

Mathematics, not magic

Hypothetically, what we want is something that sounds impossible:

  1. Some way (let’s call it a Magic Spell) to change some personal info (like an email address) into something else, so it no longer looks like an email address and can’t be used as one. Let’s call this new thing Gibberish.
  2. It must be impossible (or at least very time-consuming) to change the Gibberish back into the original email address (i.e. to undo the Magic Spell).
  3. We still need a way to compare two pieces of Gibberish to see if they’re the same.

Clearly, that’s an impossible set of demands.

Except that it’s not impossible at all.

We’ve just described hashing, which is a perfectly real and readily-available thing. Unlike almost all forms of magic, it does actually exist - and like all actually-existing forms of magic, it’s based entirely on mathematics. Science to the rescue!

Now, as mind-blowing as it is to learn that our impossible demands are actually eminently satisfiable, here’s something even more mind-blowing: you’ve already seen hashing at work, and you’ve maybe even seen some Gibberish!

Don’t believe me? I don’t blame you. But just like The Silence from Doctor Who, Gibberish is hiding in plain sight, and you’ll forget about it as soon as you’ve seen it. It might be behind you right now! Allow me to present a (somewhat less terrifying) example.

Have you ever heard of OpenOffice? It’s an open source suite of productivity applications, a bit like Microsoft Office, but the crappy UX doesn’t cost anything. If you wanted to download OpenOffice, you would visit this download page. Go and have a look, if you like. No scary suit-wearing hypno-aliens to be found.

OR ARE THERE?

Take a closer look at the first download box; the green one. It looks like this:

The OpenOffice download page

Part of the OpenOffice download page.

Normal enough. Most people just click it, the download starts, and they move on to something else. They do not see.

There’s something strange about that big, green down-arrow graphic. Something… sinister. Look at it, sitting there. Malevolently.

Though the placement is coincidental (due to the browser window’s width when I took the screenshot), that arrow is pointing at something. Something barely intelligible, like a foreign language. It says “MD5 checksums”. The hairs on the back of your neck begin to stand on end. Let’s say you clicked the link, unsuspectingly. The horrible truth would be laid bare:

The Silence from Doctor Who

BLAARGH!

OK, that’s not really what you’d see. Instead, you’d see this file, which looks like this:

Hashes (as file checksums) for OpenOffice

Gibberish on a server.

Rows and rows of data, in two columns. The second column quite clearly contains filenames. The first column is of course - wait for it - Gibberish.

Let’s try another example. This time, it’s in your computer. On Mac OS X, try going to this folder: ~/Library/Caches/com.apple.Safari/Webpage Previews. Here’s what mine looks like:

Hashes (as UUIDs) in the Finder

Gibberish inside your computer.

There are over five thousand files in that folder, here on my Mac. Look at the filenames. Inside your own computer, right here within arm’s reach. Gibberish.

Like The Silence, it hides everywhere, and you forget it as soon as you look away. But it’s still always there. YOU WILL NEVER SLEEP AGAIN.

A bit more factually, if you please

The two examples of Gibberish above illustrate two different common uses of hashing:

  1. Creating a checksum of a file.
  2. Creating a unique identifier.

A checksum lets you verify that your copy of a file is identical to the original, for example so that you know it wasn’t corrupted during download. It’s not enough to check that the file size is the same, because you can have two pieces of data that are the same length but are nonetheless different. For example, say a person’s name was “Neil Inglis”, which is 11 characters in length. That’s unquestionably different from another 11-characters-long name, such as, say, “Arse Badger”.

Thus, we need to know that every part of the file is the same (to a reasonably high level of probability). Hashing lets us do that. The person hosting the file runs a program which generates a hash (Gibberish) that represents (but does not contain; it’s not like a zip file or such) the file. After you download the file, you also run the same program to generate a hash. If the two hashes match, your copy of the file is identical to the original, and you don’t need to download OpenOffice all over again.

You can also use hashes to generate unique identifiers. In our second example, Safari (the web browser, on Mac OS X in this case) has created preview images of web pages that I’ve visited - presumably to show in its Top Sites view, or for some other purpose. Instead of naming those JPEG files with the URL of the web page, which could be exceptionally long and horrible, Safari instead generates some Gibberish (almost certainly based on the URL), giving a consistent length of only-slightly-horrible filename. To find the preview JPEG for a given URL, Safari simply re-generates the Gibberish from a given URL, then checks to see if it has a JPEG with that Gibberish filename. That’s hashing.

There’s a third possible use of hashing, too (and a fourth, and a fifth, and many others), and that’s to anonymise data but still allow matching. If you think about it, that’s obvious from what we’ve already discussed. Here’s the workflow, once again using Path as an example.

  1. Bob signs up to Path, using the iPhone app.
  2. The app asks if it can use Bob’s contact info to find his friends.
  3. The app hashes the email addresses of everyone in Bob’s address book.
  4. Only the hashes are uploaded to Path’s servers. Just the Gibberish.
  5. Path’s servers store the hashed (Gibberish) version of every user’s email address.
  6. Path’s server searches for Bob’s friends’ hashed email addresses in their database of hashed email addresses. Because you always get the same Gibberish whenever you hash a given email address, you can still match them up even though you don’t know what the original email address looks like. “GLORB” = “GLORB”, just as “you@domain.com” = “you@domain.com”.
  7. Bob finds his friends just as easily as if Path had uploaded their actual email addresses.
  8. Path then deletes all the hashed email addresses they uploaded from Bob’s iPhone.

Everyone is happy. Your social friend-finding features are intact, and every bit as convenient as before. But, none of your friends’ email addresses are ever uploaded (in a readable, usable form) to some company’s server. Privacy is preserved along with convenience. It’s a mathematical miracle. SCIENCE, MOTHERFUCKER. DO YOU SPEAK IT?

Boring geek stuff, in brief

Your brain may be complaining about the concept of being able to transform an email address into Gibberish easily, but not being able to reverse the process. I can understand that it’s not an intuitive thing - it is nevertheless a mathematical reality, and commonplace. Those wishing to learn more, or simply to cite a reference on the subject, should read the wikipedia page on cryptographic hash functions, and particularly the concept of a one-way function.

You may raise an objection about using a given standard hashing function as-is. You’ll want to read about the concept of salt (in the cryptographic sense).

You may complain that I’ve not suggested a specific hashing function or algorithm; this is deliberate. I invite you to educate yourself on your choices, and make an informed decision that’s suitable for the time period and technological environment in which you’ve read this article. You’ll want to pay particular attention to contemporary security industry assessments of your proposed hashing function.

Common counterarguments

There follows a list of anticipated counterarguments, shrink-wrapped with rejoinders of varying degrees of flippancy and superciliousness.

Hashing isn’t absolutely perfect in every way!

This is no way invalidates the fact that it’s a hell of a lot better than storing plain contact information, like a crazy person.

(Some specific hash function) is vulnerable to (dictionary attacks, or something)!

Maybe, and maybe not. Salt helps a lot. Security is a percentages game. Never use just one security method. See previous counterargument.

Actually, technically, you’re wrong about (something)!

Quite possibly, but not in a way that meaningfully alters this article. I explained the concept of hashing. I showed how it can be easily used to address the latest flap of privacy concerns in social media apps. I did all of the above with charm, panache, and in a manner that even grunting tech journos have a sporting chance of grasping. I shall now retire to an expensive armchair and bask in the warm glow of a job well done.

Hashing won’t work if the data format varies, like with email aliases or phone numbers!

The argument here is that sometimes, the same data can be formatted in several different ways, and the different versions will of course each produce different hashes. Your copy of my phone number, for example, might include my county-code (+44, for the UK), whereas my copy of it probably doesn’t. Similarly, two apparently different email addresses can in fact be one and the same, such as foo@googlemail.com and foo@gmail.com, or foo.bar@gmail.com and foobar@gmail.com.

You’d think that’d be a problem, but it’s not - and it’s actually another example of a straightforward Computing Science principle that’s worryingly poorly-known amongst developers. The answer, of course, is normalisation: the conversion of the various alternate formats of a given piece of data into one definitive (canonical) format. In the case of phone numbers, this would presumably be a full international number with standardised spacing. In the case of email addresses, I defer to the lecturer who taught my Operating Systems, and Distributed Algorithms and Systems Honours courses at university, Dr. Peter Dickman:

There are relatively few such rewriting rules and they are quickly and easily expressed in a compact form using some more magic, called regular expressions. Instead of uploading the original email addresses and doing the matching at the server, it's perfectly possible to download those rewrite rules along with the hashing software and do a tiny magic trick, called normalisation, at the client's device immediately before applying the hashing. This would ensure that the hash values come out the same regardless of which of the equivalent forms was used in the address book.

My thanks to Peter (now at Google in Zurich) for writing to me this evening and suggesting this additional counterargument and response (and indeed for his significant part in my formal Computing Science education).

On the storing of personal information

The astute reader (cue everyone tensing up imperceptibly) will have noticed that it’s not actually necessary to keep any user data after you’ve found their friends. Take another look at the workflow for a hashing-enabled Path app, above. The last step deletes all the uploaded hashed data, and the whole friend-finding process still works. Instapaper does it this way, according to a tweet from its creator Marco Arment earlier today.

There’s only one situation where you might argue that it’s necessary to store a user’s address book info for later (hashed or otherwise). It’s this:

  1. Jane signs up for Path, a week after Bob has signed up.
  2. Bob has Jane in his address book, but Jane does not have Bob in hers.
  3. When Path searches for Jane’s friends, it will obviously not find Bob (since it uses Jane’s address book). Thus, Bob will never know that Jane has signed up, and will not be automatically connected to her.

The thinking here is that, if you stored a person’s entire address book, when you were searching for friends you could search not only your other users, but also their entire address books. This would deal with the situation above, and would allow Bob to know that Jane has joined Path even though they didn’t join at the same time.

That’s good for Path, because social networks live and die by the number of connections people make on them. I don’t think it’s great for Jane, though. The fact that Bob isn’t in Jane’s address book shows that, for whatever reason, Jane doesn’t consider him a close acquaintance. She probably doesn’t care that she won’t be automatically connected to him, and doesn’t want him to be notified when she joins Path. That seems like common sense to me. So, I think that this particular scenario is a false one, at least from the user’s point of view.

Update: Dr. Peter Dickman (mentioned previously) emailed me to point out that Jane could also be upset even if Path did not store each user’s address book. If Jane (still without Bob in her address book) joined Path before Bob, then when Bob joined and his address book was used to find his friends, Path would naturally connect him with Jane (since she is in his address book). That’s a privacy flaw in the concept of using address books on a temporary basis to find “friends”: it doesn’t ensure that the relationship is symmetrical (or bi-directional).

In this situation, keeping each user’s entire address book on the server would allow checking to see if a given relationship was symmetrical. My own opinion is that, for information-exposing services (like Facebook, Path, iOS’ “Find My Friends”, etc), the only valid and justifiable automated connection is a symmetrical one.

Peter also points out a third scenario: that of the mutual acquaintance. Consider both Bob and Jane once more, but with a third person whom we’ll call Alice. Bob and Alice both have Jane in their respective address books, but Bob and Alice don’t know each other (i.e. Jane is their mutual acquaintance, but they themselves are not acquainted). In this scenario, let’s imagine that Jane does not sign up for Path, but both Bob and Alice do.

It could be argued that it might be a useful feature if Path could tell Bob and Alice that they have a mutual acquaintance (this would encourage connection-forming, and would also encourage Bob and/or Alice to invite Jane to join too). This raises a grave privacy concern. Perhaps Bob knows Jane from work, but Alice knows Jane in some other capacity that Jane wouldn’t want Bob to know about (such as being a member of a Take That fanclub). If Path were to keep a hashed copy of each person’s address book, it could tell Bob and Alice that they had someone in common, even if it didn’t say who - and Bob and Alice could then potentially work out that the person was Jane, which would cause irreparable damage to Jane’s career when she was outed as a fan of terrible, terrible ‘music’.

In this case, the lesson is that there can be unforeseen repercussions from revealing aspects of a social graph, even if identities are kept anonymous. A conservative and well-considered approach is strongly advised.

Final thoughts

If you’re a developer who’s implementing a social network, please do these things:

  1. Educate yourself about hashing; it’s real, and very useful. Use hashing for personal info. Do the hashing client-side, and only upload hashed data for comparison on the server.
  2. Delete the hashed data after you’ve done your fancy friend-matching stuff, because your users value their privacy, and you probably don’t even need to keep the data anyway.

If you’re a journalist or other non-developer who’s writing about social media and privacy, please do these things:

  1. Know pretty much what hashing is, at least in terms of the Incredible Magic it lets you do.
  2. Realise and understand that privacy and social features are not mutually exclusive. Don’t pull that ignorant false dichotomy bullshit; it’s factually incorrect and laughable.

And regardless of who you are, if you have any taste whatsoever:

  1. Follow me (@mattgemmell) on Twitter.