Bloom filters are probabilistic data structure for determining whether an element is in a set. Such a data structure offers only two methods - add
and mightContain
. Google’s guava library offers a nice implementation, but unfortunately this implementation (like every implementation I’ve found) is not concurrent. Concurrent reads are no problem, but writes are trickier - and reading while writing is also not straightforward.
The best I’ve been able to come up with is the following version, which combines a BloomFilter
with a ConcurrentHashMap
. Unfortunately, the space behavior of this data structure is also not as good as a BloomFilter
- the space complexity varies inversely with the concurrency. Code for this is available on github.
The basic idea is the following. We store an AtomicReference
to the BloomFilter
, and we also store a ConcurrentHashMap
:
When we add
an element to the filter, we simply insert it into newValues
. We also increment the insertionsSinceCompact
value.
Once numInserted
reaches insertionsBeforeCompact
(or some multiple)`, we attempt to compact the data structure.
I’ll come back to the compaction process in a moment. The mightContain
operation is then defined as first checking if a value is in the ConcurrentHashMap
, and then checking if it’s in the BloomFilter
:
The compaction process consists of moving values from the ConcurrentHashMap
into the BloomFilter
. However, we can’t simply do this willy nilly - there is a very real possibility that multiple compactions will happen simultaneously. What we must do is copy the Bloom filter, add the elements in newValues
to it, and remember which values we added:
So far this is safely concurrent - we have merely read the ConcurrentHashMap
, and then written these values to the BloomFilter
. So at this time, the elements are contained in both the ConcurrentHashMap
and also the BloomFilter
. So at this time, the mightContain
operation will surely return the correct result. We make sure to limit the number of elements we move to size
elements because we don’t want this operation to run forever if a separate thread keeps calling add
.
Then, in the event we successfully replaced the old BloomFilter
with the new one, we can now safely remove those elements from the ConcurrentHashMap
:
During and after this process, the mightContain
operation will also return the correct result. These values might be absent from newValues
, but they will be present in the BloomFilter
.
This data structure also has no deadlocks. Suppose multiple threads begin a compaction simultaneously. At least one of these threads will successfully call mainFilter.compareAndSet(oldBf, newBf)
- the rest will fail. This means that elements will be moved from the ConcurrentHashMap
into the BloomFilter
- i.e. the process never gets stuck.
All told, this means we have a valid concurrent data structure.
In principle, I believe that the space complexity of this an amortized constant provided the rate of insertion is not too large. Suppose it takes a time c*size
for a compaction event to occur, and suppose new elements are added at a rate a
per unit time. When time == insertionsSinceCompact / a
, a compaction will be triggered. This will take time c*insertionsSinceCompact
. So as of time insertionsSinceCompact/a + c*insertionsSinceCompact = (1/a+c)insertionsSinceCompact
, a compaction will be complete. This means that the largest the ConcurrentHashMap
will ever get is a*(1/a+c)insertionsSinceCompact = (1+ac)insertionsSinceCompact
.
So although the constant factor is larger, this data structure has similar complexity to a BloomFilter.
So now, dear reader, I ask you - is this correct? Does anyone know of a better version of this? What I’ve come up with here seems right to me, but I’d love to know what the internet thinks of this. And since I know a lot of smart folks read my blog, I’m reaching out to you.