Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?

Q: Can Hadoop and MongoDB be used as Big Data Solutions?

When both Hadoop and MongoDB are used then it addresses each other’s weaknesses and strengths. Both platforms can be used as a Big Data Solution but it is very important to know if these solutions can be used and combined with your business environment.

Q: What is Hadoop big data?

It was developed by Apache Foundation and runs on the Hadoop Distributed File System. It had begun by the company Powerset as they required large amounts of data. It is similar to Google’s big table and provides access to huge amounts of data. It is a part of the Hadoop ecosystem and data consumer can read and access the data using HBase.

Q: Why choose MongoDB over HBase?

MongoDB’s broader adoption and skills availability reduces risk and cost for new projects. HBase is well suited to key-value workloads with high volume random read and write access patterns, especially for for those organizations already heavily invested in HDFS as a common storage layer.

Q: Why is Hadoop better than RDBMS?

Since Hadoop is known to handle a huge volume of data providing large-scale solutions, it can be considered for flexibility and scalability. Either way, even MongoDB is excellent in its scalability for analyzing huge volume of complex data and more efficient than RDBMS.

Tags:

postgresql

mongodb

apache-spark

hadoop

bigdata

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.

Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:

Spark (with/without HBase over Hadoop)
MongoDB or Postgres

Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.

427

asked Jan 01 '15 12:01

A T

1 Answers

I think you are on the right direction to search for software stack/architecture which can:

handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.

To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.

In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.

For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.

107

answered Oct 21 '22 04:10

Paul H.

Related questions
                            
                                Get data from mongodb (mongoose) to jade view
                            
                                Serialize one class in two different ways with Jackson
                            
                                mongodb aggregation framework nested arrays subtract expression
                            
                                How to set up separate test and development database in meteor
                            
                                mongo sort after limit after sort - Not working
                            
                                Doctrine MongoDB use without ODM
                            
                                mongodb aggregation framework - Fetch first document's field of the nested array
                            
                                Increase rlimit Mac OSX 10.8
                            
                                Mongodb duplicate key error. How do i get the error field from the error object as object?
                            
                                Mongoose or query
                            
                                Use of $COND and $EQ with a array of objects
                            
                                uncaught exception: can't have . in field names
                            
                                Delete document using findOneAndRemove Mongoose
                            
                                Connections pool in Go mgo package
                            
                                Why does Meteor use EJSON and not BSON directly?
                            
                                Specifying Mongo Query Parameters From Client Controller (MEAN.JS)
                            
                                mongoose is removing empty object out of embedded documents in array
                            
                                Unable to connect to mongoDB using mongoose
                            
                                Spring Data Mongodb - repository for collection with different types
                            
                                Mongoose error: nesting Schemas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With