Does someone really sort terabytes of data?

2 Answers

But in reality, does companies like Amazon/Ebay, sort terabytes of data? I know, they store tons of info but sorting them???

Yes. Last time I checked Google processed over 20 petabytes of data daily.

Why wouldn't they keep them sorted at the first place instead of sorting terabytes of data, is my question in a nutshell.

EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.

102

answered Oct 30 '22 18:10

NullUserException

Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.

Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.

For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.

http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/

The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.

However for sorting 1 TB I would use map-reduce using Hadoop. Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.

Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.

Here is the link to the identity mapper: http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.

answered Oct 30 '22 18:10

5 revs, 2 users 70%

Related questions
                            
                                PHP-Sort array based on another array?
                            
                                How to sort a list by last character of string
                            
                                Sorting a list of colors in one dimension?
                            
                                High performance "contains" search in list of strings in C#
                            
                                Locale based sort in Javascript, sort accented letters and other variants in a predefined way
                            
                                How can I sort a List several different ways in a JSP?
                            
                                Fastest way to check if an array is sorted
                            
                                How can I format a column of numbers in an emacs org mode table?
                            
                                Python: sort this dictionary (dict in dict)
                            
                                Java 8+ stream: Check if list is in the correct order for two fields of my object-instances
                            
                                The Most Efficient Algorithm to Find First Prefix-Match From a Sorted String Array?
                            
                                Immutable value only has mutating members
                            
                                Pandas sort_values does not sort numbers correctly
                            
                                Find oldest file in a folder using PHP
                            
                                Get N max numbers from a List<int> using lambda expression
                            
                                Why is List<T>.Sort using Comparer<int>.Default more than twice as fast as an equivalent custom comparer?
                            
                                How does random shuffling in quick sort help in increasing the efficiency of the code?
                            
                                How to sort by two fields (one numeric, one string) at the same time using the built in "sort" program?
                            
                                How to use vutify's custom sort?
                            
                                How can I cluster a graph in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does someone really sort terabytes of data?

Tags:

sorting

nsivakr

People also ask

2 Answers

NullUserException

5 revs, 2 users 70%

Recent Activity

Donate For Us