Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collection Framework ,Big data and best Practice

I have following class

public class BdFileContent {
    String filecontent;
}

E.g file1.txt has following content:

This is test
  • "This" represents single instance of file content object.
  • "is" represents another file content object
  • "test" represents another file content object

Suppose following is folder structure:

lineage
|
+-folder1
|    |
|    +-file1.txt
|    +-file2.txt
|
+-folder2
|    |
|    +-file3.txt
|    +-file4.txt
+-...
|
+-...+-fileN.txt

. . . .

N

N>1000 files
N value will be very huge value

BdFileContent class represents each string in file in directory.

I have to do lots of data manipulation and need to create a work on complex data structure .I have to perform computation on both in memory and in disk .

ArrayList<ArrayList<ArrayList<BdFileContent>>> filecontentallFolderFileAsSingleStringToken = new ArrayList<>(); 

For example Above object represents all file contents of directory. I have to add this object for tree node in BdTree .

I am writing my own tree and adding filecontentallFolderFileAsSingleStringToken as node .

In What extend collection framework data structure is appropriate for huge data.

At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.

Are they using collection framework?

Do they use there own custom data structure ?

Are they using multi node data structure with each node running on separate JVM?

Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing

Normally what other developer approach for data structure for big data ?

How other developer are handling it ?

I want to get some hints for real uses cases and experience.

like image 762
abishkar bhattarai Avatar asked Aug 06 '15 06:08

abishkar bhattarai


People also ask

What is data collection in big data?

What is big data collection? Big data collection is the methodical approach to gathering and measuring massive amounts of information from a variety of sources to capture a complete and accurate picture of an enterprise's operations, derive insights and make critical business decisions.

What are the 4 V's of data?

These Vs stand for the four dimensions of Big Data: Volume, Velocity, Variety and Veracity.


3 Answers

When you're dealing with big data you must change approach. First of all, you have to assume that all your data will not fit into the memory of a single machine, so you need to split the data among several machines, let them compute what you need to, and then re-assemble all this together. So, you can use Collection, but only for a part of the whole job.

I can suggest you to take a look at:

  • Hadoop: the first framework for dealing with big data
  • Spark: another framework for big data, often faster than Hadoop
  • Akka: a framework for writing distributed applications

While Hadoop and Spark are the de-facto standard for big data world, Akka is just a framework that is used in a lot of contexts and not only with big data: that means that you'll have to write a lot of the stuff that Hadoop and Spark already have; I put it in the list just for sake of completeness.

You can read about the WordCount example, which is the "HelloWorld" equivalent in big data world to have an idea of how the MapReduce programming paradigm works for Hadoop, or you can take a look at the quick start guide for obtaining the equivalent transformation with Spark.

like image 187
Andrea Iacono Avatar answered Oct 28 '22 18:10

Andrea Iacono


When it comes to Big Data, the lead technologies available is Hadoop Distributed File System aka HDFS (a variant of Google DFS), Hadoop, Spark/MapReduce and Hive (originally developed by Facebook). Now, as you are asking mainly about the data structure being used in Big Data processing, you need to understand the role of these system.

Hadoop Distributed File System - HDFS

In very simple words, this is a file storage system, which uses a cluster of cheap machine to store files which is 'highly available' and 'fault tolerant' in nature. So, this becomes the data input source in Big Data processing. Now this can be a structured data (say comma delimited records) or unstructured data (Content of all the books in the world).

How to deal with structured data

One prominent technology being used for structured data is Hive. This gives a Relational-database like view of the data. Note that it is not a relational database itself. The source to this view is again the files stored on Disk (or HDFS, which Big companies uses). Now here when you process the data hive, the logic is applied on the files (internally via one/more Map Reduce program) and result is returned. Now, if you wish to store this result, it is going to land on disk (or hdfs) again in the form of structured file.

Thus a sequence of Hive queries, help you to refine a big data set into the desired data set via step-wise transformation. Think it like extracting data from traditional DB system using joins and then store data into temp table.

How to deal with unstructured data

When it comes to deal with unstructured data, the Map-Reduce approach is one of the popular one, along with Apache Pig (which is ideal for semi-structured data). The Map-Reduce paradigm mainly uses disk data(or hdfs) to process them on multiple machine and output the result on the disk.

If you read the popular book on Hadoop - Orielly - Hadoop: The Definitive Guide; you will find that the Map Reduce program fundamentally works of Key- Value type of data structure (like Map); but it never keep all the values in the memory at one point of time. It is more like

  1. Get the Key-Value data
  2. Do some processing
  3. Spit the data to the disk via context
  4. Do this for all the key-values thus processing one logical unit at a time from Big Data source.

At the end, the output of one Map-Reduce program is written to disk and now you have new set of data for next level of processing (again might be another Map Reduce program).

Now to answer, your specific queries:

At this point i want to get some insight of how big company uses data structure to manipulate huge set of data generated every day.

They use HDFS (or similar Distributed File System) to store Big Data. If the data is structured, Hive is a popular tool to process them. Because Hive query to transform the data is more closer to SQL (Syntax-wise); the learning curve is really low.

Are they using collection framework?

While processing the Big data, the whole content is never kept into memory (not even on cluster nodes). Its more like a chunk of data is processed at a time. This chunk of data might be represented as a collection (in-memory) while it is being processed, but at the end, the whole set of output data is dumped back on the disk in structured form.

Do they use there own custom data structure ?

Since not all data is stored in memory, so no specific point of custom data structure comes. However, the data movement within Map-Reduce or across network happens in the form of data structure, so yes - there is a data structure; but that is not so important consideration from an application developer perspective. Again the logic inside the Map-Reduce or other Big-Data processing is going to be written by developer, you can always use any API (or custom collection) to process the data; but the data has to be written back to the disk in the data structure expected by the framework.

Are they using multi node data structure with each node running on separate JVM?

The big data in files are processed across multiple machine in blocks. e.g. a 10 TB data is processed in the block of 64 MB across cluster by multiple node (separate JVM, and sometime Multiple JVM on one machine as well). But again its not a shared data structured across JVM; rather it is distributed data input (in the form of file block) across JVMs.

Till now collection object runs on single jvm and can not dynamically use another jvm when there is signal for overflow flow in memory and lack resource for processing

You are right.

Normally what other developer approach for data structure for big data ?

For the data input/output perspective, it is always a file on HDFS. From the processing of the data (application logic); you can use any normal Java API which can be run in the JVM. Now, since JVMs in the cluster run in the Big data environment, they also have resource constraints. So, you must device your application logic to work within that resource limit (like we do for a normal java program)

How other developer are handling it ?

I would suggest to Read the definitive guide (mentioned in above section) to understand the building block of Big-Data processing. This book is awesome and touch many aspects/problems and their solution approach in Big-Data.

I want to get some hints for real uses cases and experience.

There are numerous use cases of Big data processing specially with Financial institutions. Google Analytic is one of the prominent use case, which catches the user's behavior on a website, in order to determine the best position on a webpage to place the google ad block. I am working with a leading financial institution, which loads user's transaction data into Hive in order to do a fraud detection based on user's behavior.

like image 29
Gyanendra Dwivedi Avatar answered Oct 28 '22 19:10

Gyanendra Dwivedi


These are the answers to your queries ( These queries are addressed by keeping Hadoop in mind)

Are they using collection framework?

No. HDFS file system is used in case of Hadoop.

Do they use there own custom data structure ?

You have to understand HDFS - Hadoop Distributed File System. Refer this book fro Orielly - Hadoop: The Definitive Guide, 3rd Edition for purchase. If you want to know the fundamentals without buying the book, try this link- HDFC Basics Or Apache Hadoop. HDFC file system is reliable & fault tolerant system.

Are they using multi node data structure with each node running on separate JVM?

Yes. Refer to Hadoop 2.0 YARN archictecture

Normally what other developer approach for data structure for big data ?

There are many. Refer to :Hadoop Alternatives

How other developer are handling it ?

Through the framework provided respective technologies. Map Reduce framework in case of Hadoop

I want to get some hints for real uses cases and experience

BigData technologies are useful where RDBMS fails - Data analytics, Data Warehouse (a system used for reporting and data analysis). Some of the use cases - Recommendation engines (LinkedIn), Ad targeting (youtube), processing large volumes data - find hottest/coldest day of a place over 100+ years of weather details, share price analysis, market trending etc.

Refer to many real life use cases for Big Data Use Cases

like image 34
Ravindra babu Avatar answered Oct 28 '22 19:10

Ravindra babu