I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.
When both Hadoop and MongoDB are used then it addresses each other’s weaknesses and strengths. Both platforms can be used as a Big Data Solution but it is very important to know if these solutions can be used and combined with your business environment.
It was developed by Apache Foundation and runs on the Hadoop Distributed File System. It had begun by the company Powerset as they required large amounts of data. It is similar to Google’s big table and provides access to huge amounts of data. It is a part of the Hadoop ecosystem and data consumer can read and access the data using HBase.
MongoDB’s broader adoption and skills availability reduces risk and cost for new projects. HBase is well suited to key-value workloads with high volume random read and write access patterns, especially for for those organizations already heavily invested in HDFS as a common storage layer.
Since Hadoop is known to handle a huge volume of data providing large-scale solutions, it can be considered for flexibility and scalability. Either way, even MongoDB is excellent in its scalability for analyzing huge volume of complex data and more efficient than RDBMS.
I think you are on the right direction to search for software stack/architecture which can:
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With