I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data. Characteristics: <ul> <li>High scalability, handle very large volume</li> <li>Compartmentalized - Queries always run on a single customer's data</li> <li>Support analytical queries (drill-down, slices, etc.)</li> </ul> Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?

These what I am using at work in a production environnement and it works like a charm. I copled three things PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components) <ul> <li>PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.</li> <li>LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database</li> <li>Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.</li> </ul> I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components. Hope I could help,

Database selection for a web-scale analytics application

2 Answers

These what I am using at work in a production environnement and it works like a charm.

I copled three things

PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)

PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.

I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.

Hope I could help,

142

answered Sep 18 '22 17:09

Spredzy

There are two main architectures you could opt for for true web-scale:

1. "BI" architecture

Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)

2. "NoSQL" architecture

(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)

The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.

For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.

Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.

answered Sep 19 '22 17:09

Alex Dean

Related questions
                            
                                How can I perform version control of Procedures, Views, and Functions in Postgres sql
                            
                                Correct way of creating a realtime application with Cassandra
                            
                                Can in-memory tables be added to a database diagram
                            
                                How can I allow connections by specifying docker-compose host names in postgres's pg_hba.conf file?
                            
                                Choosing a method to store user profiles?
                            
                                (How can/What should) I implement a database that scales to the upper tens of thousands requests/second?
                            
                                Database for Importing NUnit results?
                            
                                How do I avoid the use of getdate() in a SQL view?
                            
                                Storing Tags in Database. Store tag once or many times?
                            
                                What is the easiest way to simulate a database table with an index in a key value store?
                            
                                Concurrently access database with Excel as frontend - doable?
                            
                                How to optimize paging for large in memory database
                            
                                BACKUP failed to complete the command BACKUP DATABASE
                            
                                Django - Are model save() methods lazy?
                            
                                Autoincrement of table id using string combination in database automatically
                            
                                In the database, why can't we just use "Long" integers for dates (millis since epoch)
                            
                                Querying for entities with missing properties in app engine Datastore?
                            
                                Get my database under Version Control using a DVCS [Mercurial]
                            
                                Database Design: optional, but must be unique if provided a value
                            
                                Is there a way to watch a mysql database for changes using perl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Database selection for a web-scale analytics application

Tags:

database

database-design

nosql

olap

business-intelligence

Yasei No Umi

People also ask

2 Answers

Spredzy

Alex Dean

Recent Activity

Donate For Us