Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redis PFADD to check a exists-in-set query

Tags:

redis

I have a requirement to process multiple records from a queue. But due to some external issues the items may sporadically occur multiple times. I need to process items only once

What I planned to use is PFADD into redis every record ( as a md5sum) and then see if that returns success. If that shows no increment then the record is a duplicate else process the record.

This seems pretty straightforward , but I am getting too many false positives while using PFADD

Is there a better way to do this ?

like image 563
Ram Avatar asked Mar 16 '16 08:03

Ram


1 Answers

Being the probabilistic data structure that it is, Redis' HyperLogLog exhibits 0.81% standard error. You can reduce (but never get rid of) the probability for false positives by using multiple HLLs, each counting a the value of a different hash function on your record.

Also note that if you're using a single HLL there's no real need to hash the record - just PFADD as is.

Alternatively, use a Redis Set to keep all the identifiers/hashes/records and have 100%-accurate membership tests with SISMEMBER. This approach requires more (RAM) resources as you're storing each processed element, but unless your queue is really huge that shouldn't be a problem for a modest Redis instance. To keep memory consumption under control, switch between Sets according to the date and set an expiry on the Set keys (another approach is to use a single Sorted Set and manually remove old items from it by keeping their timestamp in the score).

like image 142
Itamar Haber Avatar answered Oct 10 '22 21:10

Itamar Haber