Dropbox: Deduplication with Privacy

There’s been a bit of a scare regarding Dropbox related to the possible use of deduplication to determine who has copies of “illegal” files and then the use of warrants to identify infringing Dropbox users and basically hose them.

The problem

When you store a file on Dropbox it will be hashed (more-or-less uniquely identified by scanning its content) and then the hash and the file’s size will be used to determine if the file already exists on Dropbox’s server (i.e. if your ripped copy of Avatar matches someone else’s it will have the same hash value and the exact same file size). If so, rather than uploading the file your account will simply get a new file entry pointing at the existing file. “Upload” is instant, Dropbox saves money on storage, everybody wins.

But, suppose James Cameron uploads a ripped copy of Avatar to Dropbox and notices that this 3GB MP4 file uploaded instantly. He now knows someone else has such a file on Dropbox which is reasonable cause to suspect that piracy is happening and, in theory, he can require Dropbox to tell him everyone who has a copy of that file in their account.

Hence the scare.

The obvious solution to this problem is to not knowingly store illegally duplicated files in your Dropbox account or to encrypt them using your own unique key if you do.

But it’s quite possible that any of us might accidentally put an illegal file — or perhaps a file normal people consider “fair use” but the MPAA (say) might not consider legal — in your Dropbox account. E.g. I might rip Avatar using Handbrake so I can watch it on my iPhone, and this might create an identical file to your handbraked copy of Avatar, and according to the MPAA we might both be horrible criminals who deserve the gas chamber and given that Congress only cares about people who provide large campaign donations…

A possible solution

I’ve proposed this solution on both HackerNews and DropBox’s forums. It’s not perfect — maybe someone can refine it.

I imagine Dropbox has a list of files with unique ids, sizes, and hash values, and every user has a list of files with their own personal path (where they think it is and what they think it’s called) along with the unique id of the actual underlying file. This is the heart of the problem.

Instead of storing the unique id of the underlying file in the user’s file table, Dropbox needs to store a number offset by a hash value generated client-size from the user’s password and the user’s name for the file (i.e. something that will be different for each user and each file and not replicable with data stored in Dropbox’s own database).

Note that if the user’s password is changed then every file id will need to be changed accordingly, which is definitely a downside. (And if you forget your password then your files cannot be recovered.)

Also note that presumably someone like the MPAA could simply obtain a warrant and wait for people to access an “illegal” file, but this is surely going to be a much slower and more difficult process than simply doing a query on the entire database and sending out threatening letters to everyone in the result list.

Thing is, this isn’t technically complex  to implement and could be a user preference. Would you prefer privacy with the risk of losing all your files if your password is lost? Given that you will probably have multiple backups of all your Dropbox files, it’s actually not a big problem. (In fact, if you consider the case where you are forced to reset your Dropbox password and thus Dropbox forgets you own all your files — re-uploading them from one of your computers will be instantaneous for all the files you previously had uploaded owing to deduplication.)

Edit: another problem with my proposed solution is that you can lose track of files (e.g. you can’t maintain an accurate reference count). This is probably not as big an issue as it might seem since Dropbox already retains files for a month after a non-paying user deletes them and forever for paying users. Presumably it retains copies of files left by users who stop using the service.

Final Note: I have no affiliation with Dropbox (although I do use the product) and have no stake in it. If you’d like to try Dropbox and give me more space to store potentially illegal files, please use this link.