Dumped in Coldfusion

Friday, September 22, 2006

Bad hash function causes problems with ColdFusion Structs

I was looking at the slightly complicated series of structs that are used in the cf_accelerate tag. According to discussions on the speed of structs the speed of ColdFusion structures gets really slow when you put more an more entries in them.

One of the commenters noted that by reversing the keys he could get a significant speedup. It seems that this is all to do with a bad choice of Hash function deep in ColdFusion.

A Hash table is a datastructure that is designed for fast lookups of keys. It uses a hash function to guess where something will be, then goes to check. This allows it to be faster than the average case in a lot of cases as you can find what you are looking for first time, rather than having to search through a list of entries, or use a complicated tree structure. Unfortunately if you look in the same place for two entries it still has to fall back on looking through them one by one, and so looses all of the speed advantages.

Choosing a hash function is a compromise. On one hand, the hash function gets called a lot, often in speed critical code (which is why I was surprised that java.net.URL uses DNS in its .equals method, but that is another topic), however the other aim of hash codes is to differentiate different inputs. At one extreme, a constant hash code fulfills all the requirements of the fast hash code, but it isn't very useful as your hash table becomes slow like a list.

Example of a key with the characters used in the hash code highlighted

The hash function choosen for the keys to structs in ColdFusion uses an evenly distributed selection of 4 of the letters from the key. These start with the first letter, then sample the rest of the key, as shown in the example to the left. The green-underlined letters are used to form the hash code.

Example of a longer key with the characters used in the hash code highlighted

The key to why all of the benchmarks (and many common use cases) show that it is slow is that this sampling breaks down when you have a long, constant prefix on all of your structure keys. In this case (as seen to the right), the algorithm will hash all of your keys to the same place.

Some people use really short keys in their Structs. These are nearly as bad as all keys less than 4 characters long are hashed based only on their first character, so ABB will hash to the same thing as AAA.

So there are two ways that the algorithm chosen breaks down. Unfortunately both of these are commonly used by people choosing keys for structs. Given that the special algorithm that has been implemented in ColdFusion probably took a programmer time to implement, it had better be a lot faster than Java's built-in hashCode method on String's in most cases given that it trivially fails to differentiate keys in common situations.

To show an this, lets take one of the examples quoted by a commenter.


<cfset s = structNew()>
<cfloop from="1" to="100000" index="idx">
<cfset s["xxxx#idx#"] = 0>
</cfloop>


He claims that this takes 6 minutes, compared to 1 second for a similar piece of code using a simple java.util.HashMap. When you look at the hash algorithm you start to see why. Although there are 100,000 items in the hash table, only 91 hash codes are used. So rather than the maybe 1 or 2 lookups that would be expected in a hash table to find where to put/find a key, there are up to 1100, averaging 550. That is, due to this badly designed hash function, lookups into the structs do on average 550 times as much work as they could.

If you reverse the keys then it is a lot faster as it uses 1010 hashcodes, and normally does 50 lookups for each entry. In contrast the standard hashcode used by Java for Strings gives every one of the 100,000 keys a different hashcode, allowing the keys to be found instantly.

In summary, be very careful what you use for the keys of your structs if you are going to create very large structures then use them in speed critical code. It is probably safest not to do this and to use java.util.HashMap objects instead if you want speed. For everyday use with small structs and very different keys this will not be a problem to you, but if you do start seeing slowdown in your struct accesses then consider this a possible reason.