Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logging infinite data on periodic intervals

I have an infinite number of entries being fed through a web interface. On a per-minute basis, I'd like to dump elements that were received in the last hour into a file named appropriately (datetime.now().strftime('%Y_%m_%d_%H_%M')).
Here's my design so far:

Thread-1

Keeps receiving input and adding to a data_dict of structure:
{datetime.now().strftime('%Y_%m_%d_%H_%M'): []}

Thread-2

Sleeps for a minute and writes contents of data_dict[(datetime.now() - timedelta(minutes=1)).strftime('%Y_%m_%d_%H_%M')]

Question

  1. Is using dict in this manner thread-safe?
  2. Is this a good design? :)
like image 791
Lelouch Lamperouge Avatar asked Mar 20 '23 13:03

Lelouch Lamperouge


2 Answers

1) This is (almost) thread safe. The individual operations on dict are thread safe, and your reading thread should never reads from a key which is still being written to. The exception is the following race condition which relies on the context switching occuring close to a minute boundary.

Thread 1: receives a message at 2014-05-20 13:37:59.999 then is pre-empted

Thread 2: checks the time (it is now 2014-05-20 13:38:00.000) so it reads from 2014_05_20_13_37

Thread 1: appends its message to the end of the 2014_05_20_13_37 queue

2) No, this is not good design and not just because there is an edge case in the thread safety condition. If you need to guarantee something for every minute, sleeping is a very error prone way to do this. First of all the sleep operation does not sleep for EXACTLY the amount of time given. It sleeps for at least that amount of time. Second, even if it was exact, the rest of your operation still takes some time which means there would be milliseconds of drift between your sleep calls. These two factors will likely result in you missing a minute every 6000-60000 minutes of so.

Ignoring your race condition in part 1, I would do the following:

def generate_times():
    now = datetime.datetime.now()
    next_time = datetime.datetime(now.year, now.month, now.day, now.hour, now.minute)
    while True:
        yield next_time
        next_time += datetime.timedelta(minute=1)

def past_times():
    for time in generate_times():
        while time > datetime.datetime.now() - datetime.timedelta(minute=1):
            time.sleep(1.0)
        yield time

The first function creates a generator which generates all of the on the minute times and the second function ensures that the time has already passed.

Probably the easiest way to handle the race condition from part one would be to mag thread 2 lag 2 minutes behind instead of just one (or a minute and 7 seconds or 16 minutes, whatever you want). This still isn't failproof: if you have something which stalls your kernel for a long time then these race conditions could still occur, but that is a perfect storm scenario.

If you want to be 100% correct then thread 1 needs to keep a timestamp tracking the latest time that it has written out logs at, but then you are going to want to look into non-blocking IO to make sure your last log doesn't stall if nothing is thread one is stuck waiting for something to log. I, myself, would just go for using 2 minutes inside of the past_times function instead of 1 unless this is mission critical logging for something that lives depend on.

like image 142
Brendan F Avatar answered Mar 24 '23 11:03

Brendan F


Consider using an external store for this - redis would be my choice and since its running independent of your application, you avoid any threading issues. Plus, redis is fast for this sort of stuff.

like image 24
Burhan Khalid Avatar answered Mar 24 '23 10:03

Burhan Khalid