C++ — hyperloglog
Table of contents
Introduction
The main idea behind HyperLogLog algorithm is to estimate the number of distinct elements in a large set using a small amount of memory. The algorithm achieves this by using a probabilistic data structure that is designed to approximate the number of distinct elements with high accuracy and low memory usage.
The algorithm works by assigning each element in the set to a hash value, which is then used to determine the index of a register in a table of fixed size. The registers are initialized to zero, and for each element, the algorithm computes a “rank” value based on the hash value. The rank is defined as the position of the leftmost non-zero bit in the binary representation of the hash value.
The algorithm maintains a set of counters, one for each register in the table. For each element, the algorithm updates the counter corresponding to the register index with the maximum rank value seen so far. After processing all elements in the set, the algorithm uses the counter values to estimate the number of distinct elements in the set.
The key insight of HyperLogLog algorithm is that the distribution of rank values for a large set of elements can be approximated by a power-law distribution, which has a characteristic shape that depends only on the size of the set and not on the specific elements in the set. This property enables the algorithm to estimate the number of distinct elements with high accuracy by only storing a small number of counter values.
To improve the accuracy of the algorithm, HyperLogLog uses a technique called “bias correction” that corrects for a systematic bias introduced by the probabilistic data structure. The bias correction involves applying a correction factor to the estimated cardinality that depends on the precision of the algorithm.
Overall, HyperLogLog algorithm provides a highly efficient and scalable method for estimating the number of distinct elements in large datasets, making it a popular tool in big data analytics and database systems.
Precision
In HyperLogLog, the precision is determined by the number of bits used to represent the registers. Specifically, the number of registers used is equal to 2^precision. The higher the precision, the more accurate the estimate of the number of distinct elements in the set will be, but also the more memory the algorithm will consume.
In the C++ implementation I provided, the precision is passed as an argument to the constructor of the HyperLogLog class. For example, if you create a HyperLogLog object with precision 10, it will use 1024 registers (2^10) to estimate the number of distinct elements in the set.
The choice of precision depends on the trade-off between accuracy and memory usage. In practice, a precision of 10-12 is often used, which provides a reasonable estimate with a moderate memory overhead. However, the optimal precision may vary depending on the specific application and the size of the set being estimated.
Memory usage
The table below shows the approximate memory usage in bytes for a HyperLogLog data structure with different numbers of bits:
Precision (bits) | Number of Registers | Memory Usage (bytes) |
---|---|---|
4 | 16 | 64 |
5 | 32 | 128 |
6 | 64 | 256 |
7 | 128 | 512 |
8 | 256 | 1024 |
9 | 512 | 2048 |
10 | 1024 | 4096 |
11 | 2048 | 8192 |
12 | 4096 | 16384 |
13 | 8192 | 32768 |
14 | 16384 | 65536 |
15 | 32768 | 131072 |
Note that these numbers are approximate and may vary depending on the specific implementation of the HyperLogLog algorithm and the memory allocation behavior of the operating system. Additionally, the memory usage of a HyperLogLog data structure may also depend on the specific dataset being analyzed and the accuracy required for the estimation.
class HyperLogLog {
public:
HyperLogLog(uint8_t precision) : precision(precision), size(1 << precision) {
M = new uint8_t[size];
std::memset(M, 0, size * sizeof(uint8_t));
alpha = get_alpha(precision);
}
~HyperLogLog() { delete[] M; }
void add(uint64_t value) {
uint32_t hash = murmurhash(value);
uint32_t index = hash >> (32 - precision);
uint8_t rho = count_leading_zeros(hash << precision) + 1;
if (rho > M[index]) {
M[index] = rho;
}
}
uint64_t count() const {
double Z = 0;
for (uint32_t i = 0; i < size; i++) {
Z += 1.0 / (1 << M[i]);
}
double estimate = alpha * size * size / Z;
if (estimate <= 2.5 * size) {
uint32_t zeros = 0;
for (uint32_t i = 0; i < size; i++) {
if (M[i] == 0) {
zeros++;
}
}
if (zeros != 0) {
estimate = size * std::log((double)size / zeros);
}
} else if (estimate > 1.0 / 30.0 * std::pow(2, 32)) {
estimate = -std::pow(2, 32) * std::log(1 - estimate / std::pow(2, 32));
}
return static_cast<uint64_t>(estimate);
}
private:
uint8_t precision;
uint32_t size;
uint8_t *M;
double alpha;
uint32_t murmurhash(uint64_t value) const {
uint32_t seed = 0x9747b28c;
uint32_t m = 0x5bd1e995;
uint32_t h = seed ^ sizeof(value);
uint32_t k = value;
k *= m;
k ^= k >> 24;
k *= m;
h *= m;
h ^= k;
h *= m;
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return h;
}
uint8_t count_leading_zeros(uint32_t x) const {
uint8_t n = 1;
if ((x >> 16) == 0) {
n += 16;
x <<= 16;
}
if ((x >> 24) == 0) {
n += 8;
x <<= 8;
}
if ((x >> 28) == 0) {
n += 4;
x <<= 4;
}
if ((x >> 30) == 0) {
n += 2;
x <<= 2;
}
n -= x >> 31;
return n;
}
double get_alpha(uint8_t precision) const {
switch (precision) {
case 4:
return 0.673;
case 5:
return 0.697;
case 6:
return 0.709;
default:
return 0.721;
}
}
};
In this example, we create a HyperLogLog object with precision 10, which means it has 1024 registers. We then generate a set of 1 million random integers and add them to both the std::unordered_set and the HyperLogLog. Finally, we print out the actual size of the set and the estimated size of the set using the count method of the HyperLogLog object.
Note that the actual size of the set is not known, so we cannot compare the estimate to the true value. However, in practice, HyperLogLog has been shown to produce accurate estimates with a low memory footprint.
#include <cmath>
int main() {
HyperLogLog hll(
10); // Create a HyperLogLog with precision 10 (1024 registers)
std::unordered_set<uint64_t> set;
for (int i = 0; i < 1000000; i++) {
uint64_t value = rand();
set.insert(value);
hll.add(value);
}
std::cout << "Actual set size: " << set.size() << std::endl;
std::cout << "Estimated set size: " << hll.count() << std::endl;
std::cout << "Error: " << (set.size() - hll.count()) / (0.01 * set.size())
<< "%" << std::endl;
return 0;
}
Possible output
Actual set size: 999752
Estimated set size: 1013140
Error: 1.84513e+15%