C++ murmur hash
Table of contents
Introduction
The name comes from two basic operations, MUltiply and Rotate.
MurmurHash3 is a non-cryptographic hash function that produces a 32-bit hash value for a given input data. It was designed to be fast and produce good quality hash values for a wide variety of input data.
The basic MurmurHash3 algorithm works as follows:
- Initialize a 32-bit hash value to a seed value (usually 0).
- Process the input data in 4-byte blocks.
- For each block, perform a series of operations that mix the bits of the block into the hash value.
- Process any remaining bytes (i.e., the “tail” of the input data).
- Combine the remaining bytes with the hash value using a different mixing function.
- Finalize the hash value by performing a series of operations that
further mix the bits of the hash value.
- This step is optional, but it can improve the quality of the hash value.
The MurmurHash3 algorithm uses a combination of multiplication, XOR, bit shifting, and other bitwise operations to mix the bits of the input data and produce the final hash value. The specific operations used are designed to avoid “avalanche” effects (i.e., small changes in the input data should produce large changes in the hash value) and to be efficient on modern processors.
The constants c1
, c2
, rotation_1
, rotation_2
, m
, and
n
used in the algorithm are carefully chosen to produce good
quality hash values. These constants are not secret, and can be
freely shared.
Note that the MurmurHash3 algorithm is designed to be used with byte-oriented data, but it can be used with other data types by first converting them to byte arrays. Also note that while MurmurHash3 is a good general-purpose hash function, there are other hash functions that may be better suited for specific applications or data types.
Versions
MurmurHash1
The original MurmurHash was created as an attempt to make a faster function than Lookup3. Although successful, it hadn’t been tested thoroughly and wasn’t capable of providing 64-bit hashes as in Lookup3. It had a rather elegant design, that would be later built upon in MurmurHash2, combining a multiplicative hash (similar to Fowler–Noll–Vo hash function) with a Xorshift.
MurmurHash2
MurmurHash2 yields a 32-bit or 64-bit value. It came in multiple variants, including some that allowed incremental hashing and aligned or neutral versions.
- MurmurHash2 (32-bit, x86)
- The original version; contains a flaw that weakens collision in some
- MurmurHash2A (32-bit, x86)
- A fixed variant using Merkle–Damgård construction. Slightly slower.
- CMurmurHash2A (32-bit, x86)
- MurmurHash2A but works incrementally.
- MurmurHashNeutral2 (32-bit, x86)
- Slower, but endian and alignment neutral.
- MurmurHashAligned2 (32-bit, x86)
- Slower, but does aligned reads (safer on some platforms).
- MurmurHash64A (64-bit, x64)
- The original 64-bit version. Optimized for 64-bit arithmetic.
- MurmurHash64B (64-bit, x86)
- A 64-bit version optimized for 32-bit platforms. It is not a true 64-bit hash due to insufficient mixing of the stripes.
The person who originally found the flaw[clarification needed] in MurmurHash2 created an unofficial 160-bit version of MurmurHash2 called MurmurHash2_160.
MurmurHash3
The current version is MurmurHash3, which yields a 32-bit or 128-bit hash value.
When using 128-bits, the x86 and x64 versions do not produce the same values, as the algorithms are optimized for their respective platforms.
MurmurHash3 was released alongside SMHasher—a hash function test suite.
Debian package
There is a package called libmurmurhash-dev
in the standard Debian
distro.
sudo apt install libmurmurhash-dev
Use the version at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp for production.
Below is presented a 32 bits version for explain the process.
// MurmurHash3 constants
constexpr uint32_t c1 = 0xcc9e2d51;
constexpr uint32_t c2 = 0x1b873593;
constexpr uint32_t rotation_1 = 15;
constexpr uint32_t rotation_2 = 13;
constexpr uint32_t m = 5;
constexpr uint32_t n = 0xe6546b64;
// This function performs the core MurmurHash3 algorithm
uint32_t murmurhash3_core(const uint8_t* data, size_t len, uint32_t h) {
const uint32_t* blocks = reinterpret_cast<const uint32_t*>(data);
const size_t n_blocks = len / sizeof(uint32_t); // four bytes block
for (size_t i = 0; i < n_blocks; ++i) {
uint32_t block = blocks[i];
block *= c1;
// block rotation 1
block = (block << rotation_1) | (block >> (32 - rotation_1));
// 01011001110001111000011111000000 original block
// --------------------------------
// 11000011111000000000000000000000 block << 15
// 00000000000000001011001110001111 block >> (32-15)
// -------------------------------- or
// 11000011111000001011001110001111 rotated block
block *= c2;
h ^= block;
// block rotation 2
h = (h << rotation_2) | (h >> (32 - rotation_2));
h = h * m + n;
}
const uint8_t* tail = data + n_blocks * sizeof(uint32_t);
uint32_t block = 0;
switch (len & 3) {
case 3:
block ^= tail[2] << 16;
[[fallthrough]];
case 2:
block ^= tail[1] << 8;
[[fallthrough]];
case 1:
block ^= tail[0];
block *= c1;
block = (block << rotation_1) | (block >> (32 - rotation_1));
block *= c2;
h ^= block;
}
h ^= len;
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
// This is the main MurmurHash3 function that clients will use
uint32_t murmurhash3(const void* key, size_t len, uint32_t seed = 0) {
const uint8_t* data = reinterpret_cast<const uint8_t*>(key);
const size_t n_blocks = len / sizeof(uint32_t);
uint32_t h = seed;
// Process the data in 4-byte blocks
h = murmurhash3_core(data, n_blocks * sizeof(uint32_t), h);
// Process any remaining bytes
const size_t n_tail = len % sizeof(uint32_t);
if (n_tail != 0) {
uint32_t k = 0;
memcpy(&k, data + n_blocks * sizeof(uint32_t), n_tail);
h ^= k;
h *= c1;
h = (h << rotation_1) | (h >> (32 - rotation_1));
h *= c2;
}
// Finalize the hash
h ^= len;
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
int main() {
// Avalanche test on seed
// Small change on key result on dramatically change on computed hash
constexpr uint32_t seed1 = 1234;
constexpr uint32_t seed2 = 1235;
// Avalanche test on input
// Small change on input result on dramatically change on computed hash
std::list<std::string> texts{"Hello, world!",
"hello, world!",
"Hello, World!",
"hello, world!",
"",
"h",
"he",
"hello"};
for (const auto& text : texts) {
uint32_t hash1 = murmurhash3(text.data(), text.size(), seed1);
uint32_t hash2 = murmurhash3(text.data(), text.size(), seed2);
fmt::print("hash: {:10} seed: {} text: {}\n", hash1, seed1, text);
fmt::print("hash: {:10} seed: {} text: {}\n", hash2, seed2, text);
}
// Tests with the package libmurmurhash-dev
// clang-format off
// void lmmh_x86_32( const void *addr, unsigned int len, uint32_t seed, uint32_t out[1]);
// void lmmh_x86_128(const void *addr, unsigned int len, uint32_t seed, uint32_t out[4]);
// void lmmh_x64_128(const void *addr, unsigned int len, uint32_t seed, uint64_t out[2]);
// clang-format on
for (const auto& text : texts) {
uint32_t hash32;
lmmh_x86_32(text.data(), text.size(), seed1, &hash32);
fmt::print("hash: {:10} seed: {} text: {}\n", hash32, seed1, text);
}
for (const auto& text : texts) {
uint32_t hash128A[4];
uint64_t hash128B[2];
lmmh_x86_128(text.data(), text.size(), seed1, hash128A);
lmmh_x64_128(text.data(), text.size(), seed1, hash128B);
fmt::print("hash128A: {:10} {:10} {:10} {:10} seed: {} text: {}\n",
hash128A[0], hash128A[1], hash128A[2], hash128A[3], seed1, text);
fmt::print("hash128B: {:21} {:21} seed: {} text: {}\n", hash128B[0],
hash128B[1], seed1, text);
}
return 0;
}
Possible output
hash: 1880165001 seed: 1234 text: Hello, world!
hash: 37900214 seed: 1235 text: Hello, world!
hash: 1773352101 seed: 1234 text: hello, world!
hash: 3125240818 seed: 1235 text: hello, world!
hash: 2650466895 seed: 1234 text: Hello, World!
hash: 3693886096 seed: 1235 text: Hello, World!
hash: 1773352101 seed: 1234 text: hello, world!
hash: 3125240818 seed: 1235 text: hello, world!
hash: 3613156711 seed: 1234 text:
hash: 2319787446 seed: 1235 text:
hash: 1968416228 seed: 1234 text: h
hash: 3135554948 seed: 1235 text: h
hash: 1299136751 seed: 1234 text: he
hash: 3700568032 seed: 1235 text: he
hash: 616337965 seed: 1234 text: hello
hash: 3371755843 seed: 1235 text: hello
hash: 4210478515 seed: 1234 text: Hello, world!
hash: 1215213111 seed: 1234 text: hello, world!
hash: 3644279836 seed: 1234 text: Hello, World!
hash: 1215213111 seed: 1234 text: hello, world!
hash: 254590987 seed: 1234 text:
hash: 1073392072 seed: 1234 text: h
hash: 19595036 seed: 1234 text: he
hash: 2251423591 seed: 1234 text: hello
hash128A: 4192683273 3344351611 905885657 131714559 seed: 1234 text: Hello, world!
hash128B: 6994950471748863742 5906757252613544790 seed: 1234 text: Hello, world!
hash128A: 3379794421 1391467063 204088760 2201735466 seed: 1234 text: hello, world!
hash128B: 3334729735983292266 15246033631058457288 seed: 1234 text: hello, world!
hash128A: 2645690248 1320752661 2676918588 3486440893 seed: 1234 text: Hello, World!
hash128B: 13342170012096846388 10422801084110055398 seed: 1234 text: Hello, World!
hash128A: 3379794421 1391467063 204088760 2201735466 seed: 1234 text: hello, world!
hash128B: 3334729735983292266 15246033631058457288 seed: 1234 text: hello, world!
hash128A: 396337949 2466738178 2466738178 2466738178 seed: 1234 text:
hash128B: 5006475794136178589 13573877494810213620 seed: 1234 text:
hash128A: 3741828134 1966168643 1966168643 1966168643 seed: 1234 text: h
hash128B: 11851864647889073320 18017523628106187849 seed: 1234 text: h
hash128A: 740872880 1097768591 1097768591 1097768591 seed: 1234 text: he
hash128B: 12027140842659985391 5619874163494401635 seed: 1234 text: he
hash128A: 1597004003 2034712666 2930991220 2930991220 seed: 1234 text: hello
hash128B: 10403193130508565092 11308957242644105945 seed: 1234 text: hello