Searching with bloom filter

Problem statement

Our platform is sending 4 million emails per day, and many of them contains a lot user generated content which has potential risk of spam. A very important action we need to take is: if we know the sender is already marked as a bad email, we should stop sending emails for it right away.

We now have more than ten thousands of blacklisted emails stored in database, when sending an email, we need to check against this big list to see if we can send the current email out. retrieving all emails from db is not realistic since it has significant latency.

Solutions

This is simple to implement and fit the current situation well since all emails are stored in database, we just need a stored procedure to take an email as input and return True if the email is in the table, otherwise false.

Bloom filter

Thinking out of the box, store with database itself may not be a good solution considering latency of DB read and write. Bloom filter is an excellent algorithm for this kind of large scale search.

The algorithm is very ideal, but when comes to engineering, we still need to consider the trade-offs based on the current service architecture.

  • The current black listed emails are stored in database, and the write operation is done by a different service, we need to consider do we need to change the storage for bloom filter
  • bloom filter needs tuning to get low false positive rates. considering the growth rate of black listed emails, how do we adjust the size of hash functions and bit array lengths

    Here is a useful tool to do the math: https://hur.st/bloomfilter/

My take aways

  • Thinking out of the box can bring you more brilliant ideas
  • Sometimes, excellent algorithm is not always fit