Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.

Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:

  • Spark (with/without HBase over Hadoop)
  • MongoDB or Postgres

Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.

like image 427
A T Avatar asked Jan 01 '15 12:01

A T


People also ask

Can Hadoop and MongoDB be used as Big Data Solutions?

When both Hadoop and MongoDB are used then it addresses each other’s weaknesses and strengths. Both platforms can be used as a Big Data Solution but it is very important to know if these solutions can be used and combined with your business environment.

What is Hadoop big data?

It was developed by Apache Foundation and runs on the Hadoop Distributed File System. It had begun by the company Powerset as they required large amounts of data. It is similar to Google’s big table and provides access to huge amounts of data. It is a part of the Hadoop ecosystem and data consumer can read and access the data using HBase.

Why choose MongoDB over HBase?

MongoDB’s broader adoption and skills availability reduces risk and cost for new projects. HBase is well suited to key-value workloads with high volume random read and write access patterns, especially for for those organizations already heavily invested in HDFS as a common storage layer.

Why is Hadoop better than RDBMS?

Since Hadoop is known to handle a huge volume of data providing large-scale solutions, it can be considered for flexibility and scalability. Either way, even MongoDB is excellent in its scalability for analyzing huge volume of complex data and more efficient than RDBMS.


1 Answers

I think you are on the right direction to search for software stack/architecture which can:

  • handle different types of load: batch, real time computing etc.
  • scale in size and speed along with business growth
  • be a live software stack which are well maintained and supported
  • have common library support for domain specific computing such as machine learning, etc.

To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.

In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.

For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.

like image 107
Paul H. Avatar answered Oct 21 '22 04:10

Paul H.