Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I auto-expire entires in an ETS table, while also limiting its total size?

Tags:

erlang

ets

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.

The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:

  • If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
  • If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)

If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.

The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.

The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.

Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?

like image 495
Jakub Arnold Avatar asked May 30 '15 18:05

Jakub Arnold


2 Answers

You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).

like image 112
Hynek -Pichi- Vychodil Avatar answered Sep 22 '22 13:09

Hynek -Pichi- Vychodil


I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.

like image 40
Nathaniel Waisbrot Avatar answered Sep 19 '22 13:09

Nathaniel Waisbrot