I gave an overview on bloom filter in the previous blog post. But if I summarize the properties of bloom filter as a data structure;
-
Fast inserts
-
Fast lookups
-
No false negatives -> all of the items that are inserted return true for membership existence
-
You could make the tradeoff space vs false positive rate. If you do not care about the false positive rate, you could gain quite a bit of space.
-
However, one can only lookup if a key existence, it does not allow to do reverse lookups since hash collisions are possible.
-
Various set intersection properties are available(
union
,intersection
)
We know that at this point that Bloom Filters are space efficient data structures which allow small false positive rate to be able to gain space from the items that are inserted into the Bloom Filter. As we increase the size of the Bloom Filter, we decrease the false positive rates as we reduce the hash collisions that occur for multiple items that are inserted in the Bloom Filter. Actually, as we increase the size of the bloom filter, we could approximate a vanilla hashmap. That means that we are actually not going to have hash collisions so that every item is mapped to a unique bit in the bloom filter.
In this note, I am going to go a little deeper; to detail how the internals of the bloom filter works as well as tradeoff in a problem formulation.
Formulation
Bloom filter is nothing but array of $n$ bits. We have generally $k$ hash functions to be able to insert multiple bits in the array for a given item.
-
$n$ bits: Array of $n$ bits
-
$k$ hash functions: $h_1, h_2, \ldots, h_k $
-
$b = \frac{n}{s}$: # of bits per object in dataset $s$
We are using multiple hash functions because this makes full hash collisions less frequent and make the lookup more robust so that the false positive rate of the Bloom Filter would be very small. As you could imagine, multiple hash functions allow more than one collision per position in the bloom filter, but it may not report false positive rates as all of the hash value positions should return to 1 when there is going to be a false positive report.
The reason why we define bits per object is because I am going to give the error formula in terms of the number of bits per object.
Insertion
The way we insert
an element in the bloom filter is to just set the positions for the values of the hash functions $k$. The way we formulate this
is the following:
- $x$: For $i=1, 2, \ldots, k$, set $A[h_i(x)] = 1 $
For positions that are returned by the hash functions are going to be set to 1.
Lookup
Lookup is also very similar to inserts, we are going to check the positions for the value of the hash functions $k$. If every single bit in each position is set to 1, then we would say the item is “probably” there.
- $x$: return True if and only if $A[h_i(x)] = 1$ for every $i = 1, 2, \ldots, k$
Error Rate
We could compute the false positive rate or error rate with the following formula. Note that as we increase the total hash functions, the error rate goes down as well as the we increase the bloom filter size since that would increase the total number of bits per object in dataset.
- $\le 1 - e^{\frac{-k^2}{b}}$
For a given $b$, we could find the optimal $k$ by taking the derivative of the error rate with respect to $k$.
How to set $k$?
-
Taking derivative of error rate:
-
$ k=b \times ln(2) $
By doing so, for a given plausible $b$, we could compute the total number of hash functions($k$) very easily.