30th June 2013

# Hashing Just Random Numbers

My recent project had some issues with hashing some 10 million numbers. To analyze the matter I wrote a small test program, see numberhash.c.

I wanted to know which influence the following factors play:

1. hashing just numbers (no alphabetic characters)
2. ASCII vs. EBCIDC
3. choice of hash function
5. distribution of collisions

The results for random numbers so far go like this:

1. hash function is of lesser importance
3. character set is not important

The little program numberhash.c can test four different hash functions (`-f`), can vary the size of the hash (`-h`), and the length of the string of digits (`-l`). Collision resolution is by linear probing. Linear probing was used as the original hashing algorithm was coded in COBOL, where linked lists are not that common. Random numbers are generated by `random()` seeded by `/dev/urandom`.

Compile numberhash.c with

``````cc -O3 -Wall -Wno-pointer-sign numberhash.c -o numberhash -lcrypto
``````

The following hash functions are implemented:

1. Kernighan & Ritchie: h = 31 * h + key[i]
2. SHA1
3. just take the digit string and then calculate modulo hashsize
4. h = 2 * h + key[i] * key[i]

Here is an example output for 1 million random numbers, each having default length of 10 digits:

``````\$ time ./numberhash -f0 -n1000001 -h1400093
Number of random numbers: 1000001, hash size = 1400093
Total number of collisions: 1249237 = 124.923575%, load factor: 71.423898
0   400130   28.58
1   417348   29.81
2   267712   19.12
3   148032   10.57
4    79356   5.67
5    41479   2.96
6    21747   1.55
7    11471   0.82
8     6067   0.43
9     3222   0.23
10     1645   0.12
11      874   0.06
12      478   0.03
13      270   0.02
14      149   0.01
15       54   0.00
16       25   0.00
17       23   0.00
18        7   0.00
19        3   0.00
20        1   0.00
21        0   0.00
22        0   0.00
23        0   0.00
24        0   0.00
25        0   0.00
26        0   0.00
27        0   0.00
28        0   0.00
29        0   0.00

real    0m0.296s
user    0m0.224s
sys     0m0.068s
``````

Changing the load factor to above 99% reduces waste, but multiplies the running time by factor of 30.

``````\$ time ./numberhash -f0 -n1000001 -h1001093
Number of random numbers: 1000001, hash size = 1001093
Total number of collisions: 260986225 = 26098.596401%, load factor: 99.890919
0     1137   0.11
1     1921   0.19
2     2183   0.22
3     2064   0.21
4     2055   0.21
5     1987   0.20
6     2008   0.20
7     2008   0.20
8     1996   0.20
9     2019   0.20
10     1985   0.20
11     2164   0.22
12     2176   0.22
13     2179   0.22
14     2097   0.21
15     2246   0.22
16     2276   0.23
17     2351   0.23
18     2312   0.23
19     2307   0.23
20     2433   0.24
21     2472   0.25
22     2664   0.27
23     2506   0.25
24     2505   0.25
25     2378   0.24
26     2473   0.25
27     2433   0.24
28     2409   0.24
>29  937349   93.63

real    0m6.244s
user    0m6.196s
sys     0m0.044s
``````

And here are the results for a load factor of 95%.

``````\$ time ./numberhash -f0 -n1000001 -h1053991
Number of random numbers: 1000001, hash size = 1053991
Total number of collisions: 9398528 = 939.851860%, load factor: 94.877565
0    54031   5.13
1    85178   8.08
2    88536   8.40
3    81647   7.75
4    73594   6.98
5    66787   6.34
6    60495   5.74
7    54382   5.16
8    48231   4.58
9    43684   4.14
10    38778   3.68
11    34399   3.26
12    30569   2.90
13    27584   2.62
14    25040   2.38
15    22961   2.18
16    20714   1.97
17    18643   1.77
18    17117   1.62
19    15577   1.48
20    13834   1.31
21    12723   1.21
22    11651   1.11
23    10546   1.00
24     9839   0.93
25     8697   0.83
26     7726   0.73
27     6962   0.66
28     6404   0.61
>29   57662   5.47

real    0m0.439s
user    0m0.392s
sys     0m0.044s
``````

First and last run are as below:

"Catastrophic" run is:

gnuplot's with

``````set terminal png