Back to TILs

C++ murmur hash

Date: 2023-03-30Last modified: 2024-11-02

Table of contents

Introduction

The name comes from two basic operations, MUltiply and Rotate.

MurmurHash3 is a non-cryptographic hash function that produces a 32-bit hash value for a given input data. It was designed to be fast and produce good quality hash values for a wide variety of input data.

The basic MurmurHash3 algorithm works as follows:

The MurmurHash3 algorithm uses a combination of multiplication, XOR, bit shifting, and other bitwise operations to mix the bits of the input data and produce the final hash value. The specific operations used are designed to avoid “avalanche” effects (i.e., small changes in the input data should produce large changes in the hash value) and to be efficient on modern processors.

The constants c1, c2, rotation_1, rotation_2, m, and n used in the algorithm are carefully chosen to produce good quality hash values. These constants are not secret, and can be freely shared.

Note that the MurmurHash3 algorithm is designed to be used with byte-oriented data, but it can be used with other data types by first converting them to byte arrays. Also note that while MurmurHash3 is a good general-purpose hash function, there are other hash functions that may be better suited for specific applications or data types.

Versions

MurmurHash1

The original MurmurHash was created as an attempt to make a faster function than Lookup3. Although successful, it hadn’t been tested thoroughly and wasn’t capable of providing 64-bit hashes as in Lookup3. It had a rather elegant design, that would be later built upon in MurmurHash2, combining a multiplicative hash (similar to Fowler–Noll–Vo hash function) with a Xorshift.

MurmurHash2

MurmurHash2 yields a 32-bit or 64-bit value. It came in multiple variants, including some that allowed incremental hashing and aligned or neutral versions.

The person who originally found the flaw[clarification needed] in MurmurHash2 created an unofficial 160-bit version of MurmurHash2 called MurmurHash2_160.

MurmurHash3

The current version is MurmurHash3, which yields a 32-bit or 128-bit hash value.

MurmurHash3 was released alongside SMHasher—a hash function test suite.

Debian package

There is a package called libmurmurhash-dev in the standard Debian distro.

sudo apt install libmurmurhash-dev

Use the version at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp for production.

Below is presented a 32 bits version for explain the process.

// MurmurHash3 constants
constexpr uint32_t c1 = 0xcc9e2d51;
constexpr uint32_t c2 = 0x1b873593;
constexpr uint32_t rotation_1 = 15;
constexpr uint32_t rotation_2 = 13;
constexpr uint32_t m = 5;
constexpr uint32_t n = 0xe6546b64;
// This function performs the core MurmurHash3 algorithm
uint32_t murmurhash3_core(const uint8_t* data, size_t len, uint32_t h) {
  const uint32_t* blocks = reinterpret_cast<const uint32_t*>(data);
  const size_t n_blocks = len / sizeof(uint32_t);  // four bytes block

  for (size_t i = 0; i < n_blocks; ++i) {
    uint32_t block = blocks[i];

    block *= c1;
    // block rotation 1
    block = (block << rotation_1) | (block >> (32 - rotation_1));
    // 01011001110001111000011111000000 original block
    // --------------------------------
    // 11000011111000000000000000000000 block << 15
    // 00000000000000001011001110001111 block >> (32-15)
    // -------------------------------- or
    // 11000011111000001011001110001111 rotated block

    block *= c2;

    h ^= block;

    // block rotation 2
    h = (h << rotation_2) | (h >> (32 - rotation_2));
    h = h * m + n;
  }

  const uint8_t* tail = data + n_blocks * sizeof(uint32_t);
  uint32_t block = 0;

  switch (len & 3) {
    case 3:
      block ^= tail[2] << 16;
      [[fallthrough]];
    case 2:
      block ^= tail[1] << 8;
      [[fallthrough]];
    case 1:
      block ^= tail[0];
      block *= c1;
      block = (block << rotation_1) | (block >> (32 - rotation_1));
      block *= c2;
      h ^= block;
  }

  h ^= len;
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}
// This is the main MurmurHash3 function that clients will use
uint32_t murmurhash3(const void* key, size_t len, uint32_t seed = 0) {
  const uint8_t* data = reinterpret_cast<const uint8_t*>(key);
  const size_t n_blocks = len / sizeof(uint32_t);

  uint32_t h = seed;

  // Process the data in 4-byte blocks
  h = murmurhash3_core(data, n_blocks * sizeof(uint32_t), h);

  // Process any remaining bytes
  const size_t n_tail = len % sizeof(uint32_t);
  if (n_tail != 0) {
    uint32_t k = 0;
    memcpy(&k, data + n_blocks * sizeof(uint32_t), n_tail);
    h ^= k;
    h *= c1;
    h = (h << rotation_1) | (h >> (32 - rotation_1));
    h *= c2;
  }

  // Finalize the hash
  h ^= len;
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}
int main() {
  // Avalanche test on seed
  // Small change on key result on dramatically change on computed hash
  constexpr uint32_t seed1 = 1234;
  constexpr uint32_t seed2 = 1235;

  // Avalanche test on input
  // Small change on input result on dramatically change on computed hash
  std::list<std::string> texts{"Hello, world!",
                               "hello, world!",
                               "Hello, World!",
                               "hello, world!",
                               "",
                               "h",
                               "he",
                               "hello"};

  for (const auto& text : texts) {
    uint32_t hash1 = murmurhash3(text.data(), text.size(), seed1);
    uint32_t hash2 = murmurhash3(text.data(), text.size(), seed2);

    fmt::print("hash: {:10}   seed: {}   text: {}\n", hash1, seed1, text);
    fmt::print("hash: {:10}   seed: {}   text: {}\n", hash2, seed2, text);
  }

  // Tests with the package libmurmurhash-dev
  // clang-format off
  // void lmmh_x86_32( const void *addr, unsigned int len, uint32_t seed, uint32_t out[1]);
  // void lmmh_x86_128(const void *addr, unsigned int len, uint32_t seed, uint32_t out[4]);
  // void lmmh_x64_128(const void *addr, unsigned int len, uint32_t seed, uint64_t out[2]);
  // clang-format on
  for (const auto& text : texts) {
    uint32_t hash32;
    lmmh_x86_32(text.data(), text.size(), seed1, &hash32);
    fmt::print("hash: {:10}   seed: {}   text: {}\n", hash32, seed1, text);
  }

  for (const auto& text : texts) {
    uint32_t hash128A[4];
    uint64_t hash128B[2];

    lmmh_x86_128(text.data(), text.size(), seed1, hash128A);
    lmmh_x64_128(text.data(), text.size(), seed1, hash128B);

    fmt::print("hash128A: {:10} {:10} {:10} {:10}   seed: {}   text: {}\n",
               hash128A[0], hash128A[1], hash128A[2], hash128A[3], seed1, text);

    fmt::print("hash128B: {:21} {:21}   seed: {}   text: {}\n", hash128B[0],
               hash128B[1], seed1, text);
  }

  return 0;
}

Possible output

hash: 1880165001   seed: 1234   text: Hello, world!
hash:   37900214   seed: 1235   text: Hello, world!
hash: 1773352101   seed: 1234   text: hello, world!
hash: 3125240818   seed: 1235   text: hello, world!
hash: 2650466895   seed: 1234   text: Hello, World!
hash: 3693886096   seed: 1235   text: Hello, World!
hash: 1773352101   seed: 1234   text: hello, world!
hash: 3125240818   seed: 1235   text: hello, world!
hash: 3613156711   seed: 1234   text: 
hash: 2319787446   seed: 1235   text: 
hash: 1968416228   seed: 1234   text: h
hash: 3135554948   seed: 1235   text: h
hash: 1299136751   seed: 1234   text: he
hash: 3700568032   seed: 1235   text: he
hash:  616337965   seed: 1234   text: hello
hash: 3371755843   seed: 1235   text: hello
hash: 4210478515   seed: 1234   text: Hello, world!
hash: 1215213111   seed: 1234   text: hello, world!
hash: 3644279836   seed: 1234   text: Hello, World!
hash: 1215213111   seed: 1234   text: hello, world!
hash:  254590987   seed: 1234   text: 
hash: 1073392072   seed: 1234   text: h
hash:   19595036   seed: 1234   text: he
hash: 2251423591   seed: 1234   text: hello
hash128A: 4192683273 3344351611  905885657  131714559   seed: 1234   text: Hello, world!
hash128B:   6994950471748863742   5906757252613544790   seed: 1234   text: Hello, world!
hash128A: 3379794421 1391467063  204088760 2201735466   seed: 1234   text: hello, world!
hash128B:   3334729735983292266  15246033631058457288   seed: 1234   text: hello, world!
hash128A: 2645690248 1320752661 2676918588 3486440893   seed: 1234   text: Hello, World!
hash128B:  13342170012096846388  10422801084110055398   seed: 1234   text: Hello, World!
hash128A: 3379794421 1391467063  204088760 2201735466   seed: 1234   text: hello, world!
hash128B:   3334729735983292266  15246033631058457288   seed: 1234   text: hello, world!
hash128A:  396337949 2466738178 2466738178 2466738178   seed: 1234   text: 
hash128B:   5006475794136178589  13573877494810213620   seed: 1234   text: 
hash128A: 3741828134 1966168643 1966168643 1966168643   seed: 1234   text: h
hash128B:  11851864647889073320  18017523628106187849   seed: 1234   text: h
hash128A:  740872880 1097768591 1097768591 1097768591   seed: 1234   text: he
hash128B:  12027140842659985391   5619874163494401635   seed: 1234   text: he
hash128A: 1597004003 2034712666 2930991220 2930991220   seed: 1234   text: hello
hash128B:  10403193130508565092  11308957242644105945   seed: 1234   text: hello

References