Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to order huge ( GB sized ) CSV file?

Background

I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.

Naive approach

So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....

Naive approach v2

My second try was to follow a bit the idea behind MapReduce.

So, I would slice this huge file in several parts, and order each one. Then I would combine all the parts into the final file.

The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....

Objective

My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.

Question

What are the popular solutions/algorithm to order data sets this big?

like image 822
Flame_Phoenix Avatar asked May 22 '18 15:05

Flame_Phoenix


2 Answers

What are the popular solutions/algorithm to order data sets this big?

Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.

You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.

Database

You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.


Chunked Memory Sort, Manual Merge

Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).

As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia.org/wiki/External_sorting.

One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.

Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're comparing since it's now done. When there is no more data, you're done.


Sort Keys only in Memory

If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.

Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.


BTree Key Sort

If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.

Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.

like image 163
jfriend00 Avatar answered Sep 30 '22 09:09

jfriend00


I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.

like image 31
Y.B. Avatar answered Sep 30 '22 09:09

Y.B.