Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database selection for a web-scale analytics application

I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.

Characteristics:

  • High scalability, handle very large volume
  • Compartmentalized - Queries always run on a single customer's data
  • Support analytical queries (drill-down, slices, etc.)

Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?

like image 550
Yasei No Umi Avatar asked Dec 16 '10 08:12

Yasei No Umi


People also ask

Which database is best for analytics?

Oracle Database is among the most widely used databases in the industry as they support all data types involving Relational, Graph, Structured, and Unstructured information and is hence considered to be one of the best databases available in the market.

How are databases used in data analytics?

An analytic database, also called an analytical database, is a read-only system that stores historical data on business metrics such as sales performance and inventory levels. Business analysts, corporate executives and other workers run queries and reports against an analytic database.

What type of databases are used in big data applications?

Big data databases store petabytes of unstructured, semi-structured and structured data without rigid schemas. They are mostly NoSQL (non-relational) databases built on a horizontal architecture, which enable quick and cost-effective processing of large volumes of big data as well as multiple concurrent queries.

Which database in GCP is designed to provide a scalable OLTP database?

Cloud Spanner is a fully managed, mission-critical relational database service. It is designed to provide a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.


2 Answers

These what I am using at work in a production environnement and it works like a charm.

I copled three things

PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)

  • PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.

  • LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database

  • Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.

I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.

Hope I could help,

like image 142
Spredzy Avatar answered Sep 18 '22 17:09

Spredzy


There are two main architectures you could opt for for true web-scale:

1. "BI" architecture

  • Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
  • Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
  • Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)

2. "NoSQL" architecture

  • (Optional) Event journaller or immutable event store feeds
  • NoSQL database (e.g. Cassandra, Riak, HBase) feeds
  • A custom analytics UI (e.g. using D3.js)

The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.

For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.

Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.

like image 27
Alex Dean Avatar answered Sep 19 '22 17:09

Alex Dean