Where does Big Data go and how is it stored?

Tags:

I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.

I'm familiar with the traditional form of data management and data life cycle; e.g.:

Structured data collected (e.g. web form)
Data stored in tables in an RDBMS on a database server
Data cleaned and then ETL'd into a Data Warehouse
Data is analysed using OLAP cubes and various other BI tools/techniques

However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.

From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?

To me, the Big Data life cycle goes something on the lines of this:

Data collected (structured/unstructured/semi)
Data stored in NoSQL database on a Big Data platform; e.g. HBase on MapR Hadoop distribution of servers.
Big Data analytic/data mining tools clean and analyse data

But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?

472

asked Apr 20 '17 16:04

RoyalSwish

1 Answers

When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.

What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.

You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.

If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.

Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.

Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.

Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).

This is my Big Data in a nutshell for people who are experienced with traditional database systems.

175

answered Oct 15 '22 00:10

Stefan Papp

Related questions
                            
                                Android check when table was last updated
                            
                                SQLITE - transposing rows into columns properly
                            
                                Does LevelDB supports hot backups (or equivalent)?
                            
                                How to modify type in a table of SQLite in Android, using onUpgrade method?
                            
                                How do I setup two one-to-many relationships in entity framework?
                            
                                SQL grant execute on multiple objects
                            
                                Which database should I use with node.js? [closed]
                            
                                How can I upload a DB to Heroku
                            
                                How to replace last occurrence of a substring in MYSQL?
                            
                                java.sql.SQLException: Io exception: Connection reset by peer: socket write error
                            
                                Best way to store a base64 encoded value in MySQL DB?
                            
                                Update sql date field in mssqlserver with YYYY-MM-DD format
                            
                                Return a set instead of list with hibernate Criteria
                            
                                Couldn't read row 0, col -1 from CursorWindow?
                            
                                sync or updateExistingPivot with Laravel -- How to fill based on a 3rd critria
                            
                                Is PostgreSQL multi-row insertion all or nothing?
                            
                                How to create a primary key consists of two fields in Django?
                            
                                Mysql Query across servers without using Federated Table
                            
                                Laravel Query Builder wherebetween
                            
                                How to create mysql database with sequelize (nodejs)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where does Big Data go and how is it stored?

Tags:

database

nosql

hadoop

bigdata

RoyalSwish

People also ask

1 Answers

Stefan Papp

Recent Activity

Donate For Us