Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are my options to store and query huge amounts of data where a lot of it is repeating?

I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.

The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.

The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.

Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.

I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.

So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.

I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.

Any suggestions or insight ?

like image 491
Nicholas Avatar asked Jan 24 '23 22:01

Nicholas


1 Answers

Hibernate (or any JPA solution) is the wrong tool for this job.

JPA/Hibernate isn't a lightweight solution. In high-volume applications, the overhead is not only significant but prohibitive. You really need to look into grid and cluster solutions. I won't repeat the overview of the various technologies here.

I've got a lot of experience in financial market information systems. A few of the things you said stuck out to me:

  • You have a lot of raw data;
  • You want to apply various aggregations to that data (eg open/high/low/close daily summaries);
  • High availability is probably an issue (it always is in these kinds of systems); and
  • Low latency is probably an issue (ditto).

Now for grid/cluster type solutions I divide them loosely into two categories:

  1. Map-based solutions like Coherence or Terracotta; and
  2. Javaspaces-based solutions like GigaSpaces.

I've used Coherence a lot and the Map solution can be nice but it can be problematic too. Coherence maps can have listeners on them and you can use this sort of thing to do things like:

  • Market price alerts (users may want a notification when a price reaches a certain level);
  • Derivative pricing (eg an exchange-traded option pricing system will want to reprice when an underlying security changes last traded price);
  • A trade-matching/booking system may want to match received trade notifications for reconciliation purposes;
  • etc.

All of these can be done with listeners but in Coherence for example listeners have to be cheap, which leads to things like a Map having a listener than writes something to another Map and this can chain on for awhile. Also, modifying the cache entry can be problematic (although there are mechanisms for dealing with that kind of problem too; I'm talking about situations like turning off a market price alert so it doesn't trigger a second time).

I found GigaSpaces type grid solutions to be far more compelling for this kind of application. The read (or destructive read) operation is a highly elegant and scalable solution and you can get transactional grid updates with sub-millisecond performance.

Consider the two classic queueing architectures:

  • Request/Response: a bad message can block the queue and while you can many senders and receivers (for scalability) scaling up the number of pipes isn't always straightforward; and
  • Publish/Subscribe: this decouples the sender and receiver but lacks scalability in that if you have multiple subscribers they'll each receive the message (not necessarily what you want with say a booking system).

In GigaSpaces, a destructive read is like a scalable publish-subscribe system and a read operation is like the traditional publish-subscribe model. There is a Map and JMS implementation built on top of the grid and it can do FIFO ordering.

Now whaqt about persistence I hear you ask? Persistence is a consequence of deciding all the other stuff. For this kind of application, I like the Persistence as a Service model (ironically written about Hibernate but it applies to anything).

Basically this means your date store hits are asynchronous and it works nicely with doing summary data. Like you can have a service listening for trade notifications and persist just the ones it's interested in (aggregating in memory if required). You can do open/high/low/close prices this way.

For high volume data you don't really want to write it all to the database. Not synchronously anyway. A persistent store plus a data warehouse is probably more the route you want to go but again this depends on requirements, volumes, etc.

It's a complicated topic and I've only really touche don it. Hope that helps you.

like image 175
cletus Avatar answered Jan 29 '23 13:01

cletus