String::hash Inconsistencies in Matz Ruby

Last night ruby stole back a few of those hours that it has saved me over the years, and made me feel like I was going just a little crazy in the process of doing it. So in the interest of documenting my failures and maybe saving you a few moments here’s a braindump of the events.

I’ve been working on an implementation of a countsketch data structure for a stream consumption project I’ve been working on. The data structure itself is very cool and deserves a post of its own soon, but think of it as a next-gen Bloom filter with some cool additive properties. As with Bloom filters I need to do quite a bit of hashing, independent universal hash functions would have been best, but its the real world so I set to use the hash function build into String (and later I found out Object and its descendents) which is an implementation of murmur hash. It’s one of the faster hashing functions, and while not a cryptographic hash its still somewhat resistant to collisions.

A few hous of coding over a month or so and I have an implementation Im happy with and it was time to add in persistence, as we know long running ruby processes don’t tend to behave so well and I don’t want to have to wait for the long term countsketch structure to refill each time. This is where the trouble started. Upon loading the dumped structure into memory it would appear empty. I searched for any known issues with dumping/loading embedded multidimensional arrays, no luck, I spend some time prying into the process, manually dumping/loading and everything functioned as expected, but failed when run on its own. After quite a digging the answer came down to this:

peck@think1:~$ irb
1.9.3-p125 :001 > “foo”.hash
 => -2769103393479644582
1.9.3-p125 :002 > “bar”.hash
 => -4482464432107200963
peck@think1:~$ irb
1.9.3-p125 :002 > “foo”.hash
 => -1670713216442024759
1.9.3-p125 :003 > “bar”.hash
 => -2834577361016353891

As turns out with the 1.9.x versions of ruby , seemingly in an effort to avoid some denial of service attacks, Object::hash uses a session local random seed in the creation of the hash. A lovely feature that it seems the documentation neglected to mention with its description, “Return a hash based on the string’s length and content.”

In the end, using SHA2 256 solved my problem of needing consistent hashes with essentially the same performance.

1.9.3-p125 :016 > timing = Benchmark.measure do book.wgrams.each {|word| sha.digest(word.to_s)} end
 =>   0.240000   0.000000   0.240000 (  0.245710)
1.9.3-p125 :018 > timing = Benchmark.measure do book.wgrams.each {|word| word.hash} end
 =>   0.190000   0.000000   0.190000 (  0.194184)

Documentation patch submitted and accepted (in modified form), so all good. Though I don’t exactly buy into the argument that since implementations of ruby will differ that this doesn’t matter, and I think the DoS protection is a weak justification. Lesson learned, for anything persisting through multiple sessions, use your own hashing… or figure out a way to set the seed yourself, though thats probably not a great path to go down. Was my assumption that a hash would be same across sessions too out of line? I’m trying to think of something else where that was the default behavior and so far coming up empty.