I'm learning to work with large amounts of data.
I've generated a file of 10,000,000 ints. I want to perform a number of sorts on the data and time the sorts (maybe plot the values for performance analysis?) but I've never worked with big data before and don't know how to sort (say, even bubble sort!) data that isn't in memory! I want to invoke the program like so:
./mySort < myDataFile > myOutFile
How would I go about sorting data that can't fit into a linked list, or array?
This is how any larger file can be sorted when there is a limitation on the size of primary memory (RAM). The basic idea is to divide the larger file into smaller temporary files, sort the temporary files and then creating a new file using these temporary files.
Some basic algorithms like Insertion or Bubble sort require no additional memory and can sort the data in place. On the other hand, more efficient algorithms like Quick sort and Merge sort require O(logN) and O(N) time complexity respectively (meaning that extra space is required to complete the sorting).
Insertion sort for large size input is not efficient. Both the number of comparisons and assignments are the largest among the three. The sorting algorithm takes the most CPU time among the three. Worst case is not easy for this sorting algorithm to avoid because the average running time is \(*H(n^2).
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
One of the best references on this, though rather technical and dense is Donald Knuth's treatment of tape sorting algorithms. Back in the day where data was stored on tape and could only be read sequentially and then written out to other tapes this kind of sorting was often done by repeatedly shuffling data back and forth between different tape drives.
Depending upon the size and type of dataset you are working with it may be worthwhile to make use of either a dedicated database to load the data into, or to make use of a cloud based service like Google's BigQuery. BigQuery has no cost to upload and download your dataset, you just pay for the processing. The first TB of processed data each month is free and you have less than even one GB of data.
Edit: Here's a very nice set of undergraduate lecture notes on external sorting algorithms. http://www.math-cs.gordon.edu/courses/cs321/lectures/external_sorting.html
You need to use external sorting
Bring in part of data at a time , sort it in memory and then merge it
More details here
http://en.m.wikipedia.org/wiki/External_sorting
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With