Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C: Sorting Big Data; Not in Memory

Tags:

c

sorting

bigdata

I'm learning to work with large amounts of data.

I've generated a file of 10,000,000 ints. I want to perform a number of sorts on the data and time the sorts (maybe plot the values for performance analysis?) but I've never worked with big data before and don't know how to sort (say, even bubble sort!) data that isn't in memory! I want to invoke the program like so:

./mySort < myDataFile > myOutFile

How would I go about sorting data that can't fit into a linked list, or array?

like image 447
wadda_wadda Avatar asked Sep 12 '14 23:09

wadda_wadda


People also ask

How do you sort data larger than memory?

This is how any larger file can be sorted when there is a limitation on the size of primary memory (RAM). The basic idea is to divide the larger file into smaller temporary files, sort the temporary files and then creating a new file using these temporary files.

Which sorting technique takes less memory?

Some basic algorithms like Insertion or Bubble sort require no additional memory and can sort the data in place. On the other hand, more efficient algorithms like Quick sort and Merge sort require O(logN) and O(N) time complexity respectively (meaning that extra space is required to complete the sorting).

Which sorting algorithm is worst for large data?

Insertion sort for large size input is not efficient. Both the number of comparisons and assignments are the largest among the three. The sorting algorithm takes the most CPU time among the three. Worst case is not easy for this sorting algorithm to avoid because the average running time is \(*H(n^2).


2 Answers

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.

One of the best references on this, though rather technical and dense is Donald Knuth's treatment of tape sorting algorithms. Back in the day where data was stored on tape and could only be read sequentially and then written out to other tapes this kind of sorting was often done by repeatedly shuffling data back and forth between different tape drives.

Depending upon the size and type of dataset you are working with it may be worthwhile to make use of either a dedicated database to load the data into, or to make use of a cloud based service like Google's BigQuery. BigQuery has no cost to upload and download your dataset, you just pay for the processing. The first TB of processed data each month is free and you have less than even one GB of data.

Edit: Here's a very nice set of undergraduate lecture notes on external sorting algorithms. http://www.math-cs.gordon.edu/courses/cs321/lectures/external_sorting.html

like image 71
caskey Avatar answered Sep 28 '22 16:09

caskey


You need to use external sorting

Bring in part of data at a time , sort it in memory and then merge it

More details here

http://en.m.wikipedia.org/wiki/External_sorting

like image 28
radar Avatar answered Sep 28 '22 16:09

radar