Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows )

We have a BI customer that generates about 40 Millions of rows every month in their sales data base tables, generated from their sales transactions. They want to build a Sales Data Mart with their historic data from 5 years, meaning that this fact table will potentially have about 240 millions of rows. ( 40 x 12 months x 5 years )

This is well structured data.

This is the first time Im confronted to this amount of data, and this took me to analyze vertical Data Bases tools like Inforbright and other ones. But still with this kind of software a simple query would take a very , very long time to run.

This took me to take a look at Hadoop, but after reading some articles, I concluded that Hadoop is not the best option ( even with Hive ) to create a fact table, since in my understanding is meant to work with unstructured data.

So, My question is: What would be the best way to build this challenge ?? , Am I not looking for the right technology ? What would be the best query response times I could get in a such big fact table ? ..or Am I confronting a real wall here and the only option is to build aggregated tables ?

like image 232
Eduardo Williams Avatar asked Jun 07 '12 17:06

Eduardo Williams


3 Answers

Have you checked out Google BigQuery (Paid Premium Service) which will suit your needs. It is as simple as

  1. Load the data in CSV (delimited by new line for record, or configurable char for field). The file can be in gzip format. You can also append to existing table.

  2. Start Querying using SQL statement ( limited sql statement though) and the results are returned in secs of multi-million rows.

  3. Extract the data into a CSV or another table ( similar to aggregation layer)

Check out here. https://developers.google.com/bigquery/

First 100GB for data processing is free. So you can get started now and it also integrates with Google Spreadsheet , which will allow you create visulaization like Charts and graphs etc for management. You can export the google spreadsheet as Microsoft Excel / PDF.

Google state it can scale to multi-terrabytes and provides real-time quering ( few secs response).

like image 63
Sathish Senathi Avatar answered Nov 15 '22 06:11

Sathish Senathi


first up i'll assume its 240m not 2400m.

Firstly take a look at ssd.analytical-labs.com

The FCC demo has a 150m record fact table running on Infobright, I would suspect on VW it would be even faster.

The key is keeping it simple, there will be queries that make it fall slow down, but largley its pretty responsive.

I would suggest you think about aggregates, the way you are querying and importantly what you are querying.

For example split it down into Marts for performance, by product, by brand, by years etc. If the user wants to just do a query on <1 years worth of data(which is more often the case than most people would like to think) they could then use a much smaller fact table.

Storage is cheep so it doesn't matter particularly if you duplicate data as long as it keeps it responsive.

Of course also if you are doing OLAP you can make use of inline aggregate tables to make sure most of the queries run at a far more acceptable level assuming they've rolled up.

Hardware is also very important, make sure you have fast disks, that is nearly always the bottle neck, the faster you can get the data off the disks generally the faster it will displayed to the end user.

Schema design is also important, modern column store databases much prefer a denormalised table with 0 joins where possible, I have found in the past, having 1 denormalised table for 90% of queries then having a few joining tables (date dim for example) for special cases counts for most use cases.

Anyway thats my 2 cents. Ping me on twitter if you want a skype about it or something.

Tom

Edit:

Also here's a non scientific bench mark to back up what JVD was saying:

  • ssd on physical box: 175.67 MB/sec
  • sata on physical box: 113.52 MB/sec
  • ec2: 75.65 MB/sec
  • ec2 ebs raid: 89.36 MB/sec

As you can see there is a large difference in read speed.

like image 42
bugg_tb Avatar answered Nov 15 '22 06:11

bugg_tb


I think there are a couple of approaches here,

1) You should try aggregate tables on mondrian, the downside of agg tables is that you need to know beforehand the use cases for most recurrent queries, if you don't then it's not so easy to tune that and you will end up having long response times for the queries you didn't optimize the aggregate table.

2) Another option is to partition the data of the fact table, maybe by year, create different schemas for every year and a virtual cube for the whole history. If you have the right software you could also create a materialized view (if you have Oracle) or a Indexed View if you have MS SqlServer.

The late approach has worked very good for me, with noticeable improvements on query times. Besides, my ETL process wasn't affected (in the option 1 you will need to create an extra process to build and maintain Aggregate Tables) since the RDMBS takes care of the process of update the data on every partition.

like image 20
crorella Avatar answered Nov 15 '22 08:11

crorella