We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO. We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.
Circular Buffers can be implemented in two ways, using an array or a linked list. An empty object array along with its capacity is initialized inside the constructor as the type of elements added is unknown. Two pointers namely head and tail are maintained for insertion and deletion of elements.
Circular buffers (also known as ring buffers) are fixed-size buffers that work as if the memory is contiguous & circular in nature. As memory is generated and consumed, data does not need to be reshuffled – rather, the head/tail pointers are adjusted. When data is added, the head pointer advances.
A ring buffer is a buffer with a fixed size. When it fills up, adding another element overwrites the oldest one that was still being kept. It's particularly useful for the storage of log information and history. There is no direct support in Python for this kind of structure, but it's easy to construct one.
A ring buffer (also known as a circular buffer or a circular queue) is a buffer data structure that behaves as if it had a circular shape, in which the last element in the buffer is connected to the first element. Like standard buffers, ring buffers typically have a fixed size.
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With