Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sort and get uniq lines of file in python

i always use this commmand line to sort and get uniq lines only and it works as a charm even with large files (over 500,000 lines)

sort filename.txt | uniq | sponge filename.txt

shortest equivalent python code would be

f = open("filename.txt", "r")
lines = [line for line in f]
lines = lines.sort()
lines = set(lines)

but of course this is not scalable because of memory constrains and writing scalable code in python would take time , so i wonder what is the shortest equivalent code (package) in python

like image 575
Hady Elsahar Avatar asked Nov 04 '13 09:11

Hady Elsahar


1 Answers

You don't need to do a sort in python since set would take care of uniqueness even without sorting.

f = open("filename.txt", "r")
lines = set(f.readlines())

The shell sort command would also load the lines into memory, so using that would not get you any memory savings. If you have really large files or you are adamant on not using additional memory, you can try some crazy tricks like the one shown here: http://neopythonic.blogspot.in/2008/10/sorting-million-32-bit-integers-in-2mb.html

like image 197
Hari Menon Avatar answered Sep 23 '22 20:09

Hari Menon