Open-sourcing homomorphic hashing to secure update propagation

jawns · on March 3, 2019

The post offers a brief explanation of homomorphic hashing:

> homomorphic hashing answers the question, “Given the hash of an input, along with a small update to that input, how can we compute the hash of the new input (with the update applied) without having to recompute the entire hash from scratch?” We use LtHash, a specific homomorphic hashing algorithm based on lattice cryptography, to create an efficiently updatable checksum of a database.

Imagine adding and subtracting hashes!

> For any two disjoint sets S and T, LtHash(S) + LtHash(T) = LtHash(S ∪ T).

This is cool because now you don't have to recompute from scratch a hash representing a large array; you can just compute the hash of the parts you update and perform addition and subtraction.

zzzcpan · on March 3, 2019

Years ago I was trying to use Merkle trees for checksuming and synchronizing database replicas, since everyone was doing it this way. It was immediately obvious how impractical it was. So the first thing that came to mind was to use a fixed sized Merkle tree (half trees or whatever they are called). It was essentially an array of 64k hashes with each element storing a hash for a set of database keys. That, however, required rescanning multiple keys on updates and the number of keys depended on the database size. Of course this wasn't going to work well. Naturally the same idea of adding and subtracting hashes came to mind. Adding a hash of a key on update and subtracting on removal allowed to completely eliminate reads from disk. And rescanning a particular set of keys was only necessary during background synchronization and only when a key was missing on a replica, which was pretty rare. Although I don't use this approach anymore, it did work well and was trivial to implement.

maximegarcia · on March 3, 2019

Saving it. This is what I dreamt of for the ETag header of a list of items in a REST API (when caching each of the items separatly).

dfox · on March 3, 2019

I somehow fail to see why the example in the article needs the hash to have some special property of homomophicity. When you represent the overall hash of your dataset as a sum of individual item hashes it trivially follows that change of one item means that you substract the hash of original version and add the hash of new version.

Or am I missing something? (apart from the somewhat obvious security implications of doing such a thing in the first place)

ivmaykov · on March 3, 2019

Disclaimer: I’m one of the authors of the paper/blog post/code.

If you want to use signatures over the hash as proof of data set integrity, you need two things. 1) you need to make sure that hash({a}) + hash({b}) == hash({a, b}). 2) ensure that hash() is collision resistant - in other words, it needs to be computationally infeasible to find hash(S) == hash(T), S != T for any sets S and T. We prove that LtHash with our choice of parameters has this property in the paper (which is linked from the blog post).

dfox · on March 3, 2019

My reading of the post is that the Hash({a, b}) is infact computed as Hash'(a) + Hash'(b) given that a and b are "rows". And thus my question is why Hash' has to have any special properties.

ivmaykov · on March 3, 2019

I can think of hash functions that are homomorphic, but are not secure. A simple example is something like “sha256 each element separately and XOR all the resulting hashes together.” This would not be collusion resistant.

We offer a proof that LtHash with our choice of parameters provides over 200 bits of security. You would have to read the paper for the details.

Scaevolus · on March 3, 2019

I can see how you might lose collision resistance with a normal 256-bit hash function, but it's not clear why a "stretched" hash with 2048 bytes of output wouldn't work.

E: Ah, there it is, in the paper:

> However, Wagner [Wag02] later showed an attack on the generalized birthday problem which could be used to find collisions > for AdHash on an n-bit modulus in time O(2^(2√n)), and that the AdHash modulus needs to be greater > than 1600 bits long to provide 80-bit security. Lyubashevsky [Lyu05] and Shallue [Sha08] showed > how to solve the Random Modular Subset Sum problem (essentially equivalent to finding collisions > in AdHash) in time O(2^(n^ε)) for any ε < 1, which indicates that AdHash requires several more orders > of magnitude larger of a modulus just to provide 80-bit security.

_urga · on March 3, 2019

Assuming a trusted system, could you go into more detail on the qualities or guarantees a Merkle tree using XOR would or would not give you and partial mitigations for any of these disadvantages? i.e. How does the entropy of a leaf node's 256-bit signature disperse into the tree up to the root? How does the birthday paradox come into play with regards to collisions at the root of the tree?

For synchronization purposes in a trusted system, a Merkle tree based on XOR seems elegant and efficient, but I can't seem to find accessible papers on this.

rstuart4133 · on March 6, 2019

Just so I have this clear: is the issue the simple solution above doesn't work (at all) with multisets, and the birthday problem combined with potentially billions of rows means even sha256 will have a significant chance generating a collision?

scottlocklin · on March 3, 2019

Are you guys going to use this in your blockchain?

mlthoughts2018 · on March 3, 2019

Say you just define your total hash to be the sum of hashes for some constituent data. You change a constituent datum, so you subtract the old hash, add the new hash.

What will you get when you decrypt the total hash? Probably some gibberish data.

If your hashing had the homomorphic property, then when you do the add/subtract song and dance with some datum’s new hash, the resulting total hash has the property that it decrypts to be the sum of the (updated) data.

A practical example might be keeping track of the average salary of a group of people without ever learning the salary of any one of them. As they get raises, you learn only about their new encrypted salaries, but you can update the encryption of the average salary directly, knowing that it will decrypt to the correctly updated new average inclusive of raises you never explicitly saw.

(Here I am using the term “hash” loosely to refer to the cipher text of whatever data item is encrypted.)