Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:

Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.

I came up with the solution like this (using Spring-Batch):

Batch job steps:

Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.

Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.

Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.

I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.

The problem is that my manager consider this task way too easy to spend that much time on it.

And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?

I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.

Thank you in advance!

UPD: I think I did not stated my question clearly. Let me ask in other way:

Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?

like image 201
Alexander Solovets Avatar asked Oct 15 '12 08:10

Alexander Solovets


People also ask

How does R handle large datasets?

Use a database You don't need to load the whole dataset into the working memory. You can even do some data handling steps in the query (e.g., sorting, grouped computations). You can also create and store preprocessed datasets (e.g., aggregated or combined datasets) in the database and access them from R.

What is a large data set?

What are Large Datasets? For the purposes of this guide, these are sets of data that may be from large surveys or studies and contain raw data, microdata (information on individual respondents), or all variables for export and manipulation.

Where are datasets stored?

Disks or tape are most frequently used for storing data sets on a long-term basis. Disk drives are known as direct access storage devices (DASDs) because, although some data sets on them might be stored sequentially, these devices can handle direct access.


1 Answers

Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.

As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).

like image 107
Storstamp Avatar answered Nov 12 '22 23:11

Storstamp