Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting text file by using Python

I have a text file includes over than 10 million lines. Lines like that:

37024469;196672001;255.0000000000
37024469;196665001;396.0000000000
37024469;196664001;396.0000000000
37024469;196399002;85.0000000000
37024469;160507001;264.0000000000
37024469;160506001;264.0000000000

As you seen, delimiter is ";". i would like to sort this text file by using python according to the second element. I couldnt use split function. Because it causes MemoryError. how can i manage it ?

like image 951
user1907576 Avatar asked Jan 22 '13 18:01

user1907576


People also ask

How do I sort a txt file?

Although there's no straightforward way to sort a text file, we can achieve the same net result by doing the following: 1) Use the FileSystemObject to read the file into memory; 2) Sort the file alphabetically in memory; 3) Replace the existing contents of the file with the sorted data we have in memory.

How do you sort a column by file in Python?

To sort CSV by multiple columns, use the sort_values() method. Sorting by multiple columns means if one of the columns has repeated values, then the sort order depends on the 2nd column mentioned under sort_values() method.


1 Answers

Don't sort 10 million lines in memory. Split this up in batches instead:

  • Run 100 100k line sorts (using the file as an iterator, combined with islice() or similar to pick a batch). Write out to separate files elsewhere.

  • Merge the sorted files. Here is an merge generator that you can pass 100 open files and it'll yield lines in sorted order. Write to a new file line by line:

    import operator
    
    def mergeiter(*iterables, **kwargs):
        """Given a set of sorted iterables, yield the next value in merged order
    
        Takes an optional `key` callable to compare values by.
        """
        iterables = [iter(it) for it in iterables]
        iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
        if 'key' not in kwargs:
            key = operator.itemgetter(0)
        else:
            key = lambda item, key=kwargs['key']: key(item[0])
    
        while True:
            value, i, it = min(iterables.values(), key=key)
            yield value
            try:
                iterables[i][0] = next(it)
            except StopIteration:
                del iterables[i]
                if not iterables:
                    raise
    
like image 57
Martijn Pieters Avatar answered Nov 15 '22 06:11

Martijn Pieters