Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the open source tools and techniques to build a complete data warehouse platform? [closed]

I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack.

I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm not sure whether these components are compatible with each other? Could someone please list them along with their position in the chain?

like image 548
understack Avatar asked Jul 22 '10 11:07

understack


People also ask

Which of the following is an open source data warehouse software?

Talend is an open-source tool owned by Talend organization for data warehousing. It is a very powerful data integration and ETL tool. Its advanced features make it easy to use and have attracted many users too.

What are the different techniques of data warehouse?

Data warehousing techniques and tools include DW appliances, platforms, architectures, data stores, and spreadmarts; database architectures, structures, scalability, security, and services; and DW as a service.


2 Answers

The Open Source Data Warehousing does a great job at identifying OSS components that could be used to build a Data Warehouse stack: Infrastructure (servers, OS, databases), Integration Management (ETL, EAI, etc), Information Management (DW/Mart/ODS, OLap Servers, etc), Information Delivery (Portal, Dashboard, Analytics/OLAP Client, etc). Here is a summary:

Open Source BI/DW Projects

BI and Analytics

  • BEE - http://bee.insightstrategy.cz/en/index.html
  • BIRT - http://www.eclipse.org/birt
  • JasperSoft – http://www.jaspersoft.com
  • MarvelIT - http://www.marvelit.com/dash.html
  • OpenI – http://openi.sourceforge.net
  • OpenReports – http://oreports.com
  • Orange - http://www.ailab.si/orange
  • Palo – http://www.palo.net
  • Pentaho - http://www.pentaho.com
  • R - http://www.r-project.org
  • SpagoBI – http://spagobi.eng.it
  • Weka - http://www.cs.waikato.ac.nz/~ml/index.html
  • VitalSigns - http://vitalsigns.sourceforge.net/

Databases

  • http://greenplum.org (bizgres)
  • http://www.ingres.com
  • http://www.mysql.com
  • http://www.postgresql.org
  • http://www.enterprisedb.com

Integration

  • Apatar - http://www.apatar.com
  • CloverETL - http://cloveretl.berlios.de/
  • JitterBit - http://www.jitterbit.com/
  • KETL - http://www.ketl.org
  • Octopus - http://www.enhydra.org/tech/octopus/index.html
  • OSDQ - http://sourceforge.net/projects/dataquality
  • Pentaho - http://www.pentaho.com
  • Red Hat – http://www.redhat.com
  • Saga.M31 Galaxy - http://galaxy.sagadc.com
  • Talend - http://www.talend.com
  • SnapLogic – http://www.snaplogic.com

I recommend browsing the presentation. Good stuff.

like image 121
Pascal Thivent Avatar answered Sep 19 '22 11:09

Pascal Thivent


A datawarehouse stack (or suite) usually consists of three layers. These are usually referenced as ETL (loading), Database & Reporting (interface). In addition, there are somewhat more advanced tools for performance and expert needs. These consist of Cubes and Statistical Analysis Tools.

As far as interoperability goes, the ETL tools and the reporting tools need to support whatever database you are using. However, since there are only two big open source databases, there is usually no problem mixing different solutions.

As for specifics -

1 - ETL

Data loading can be achieved by open-source tools such as Pentaho's Data Integration or Talend (an eclipse extension). I would suggest googling "open source etl" to tailor the solution for your specific needs.

2 - DB

You'll need a relational database (RDBMS). The two most prominent open-source players are PostgreSQL (used by Stack Overflow) and MySQL. While MySQL has a larger user base, Postgres is gaining more an more popularity ever since implementing several crucial features that were missing in earlier versions.

3 - Reporting

Pentaho offer reporting platform. So is BIRT (another eclipse extension). Again, Google is your friend for specific comparisons. Note that when if you choose Pentaho for both the ETL and Reporting tools you are likely to enjoy a better integration. You've also mentioned Mondrian, which is a tool to generate MDX queries over an RDBMS. MDX is the standard language for querying cubes.

At this point of time, assuming you are starting from scratch, I would recommend setting up the first two layers of the data warehouse - ETL & DB. You can later add any number of reporting tools above.

like image 22
shmichael Avatar answered Sep 21 '22 11:09

shmichael