I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields? I have tried hive. I would like to see if this can be done faster using python.

Have you considered using the *nix <code>sort</code> program? in raw terms, it'll probably be faster than most Python scripts. Use <code>-t $'\t'</code> to specify that it's tab-separated, <code>-k n</code> to specify the field, where <code>n</code> is the field number, and <code>-o outputfile</code> if you want to output the result to a new file. Example: <pre class="prettyprint"><code>sort -t $'\t' -k 4 -o sorted.txt input.txt </code></pre> Will sort <code>input.txt</code> on its 4th field, and output the result to <code>sorted.txt</code>

sorting large text data

1 Answers

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.

Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file. Example:

sort -t $'\t' -k 4 -o sorted.txt input.txt

Will sort input.txt on its 4th field, and output the result to sorted.txt

119

answered Nov 16 '22 04:11

urschrei

Related questions
                            
                                ValueError: check_hostname requires server_hostname
                            
                                How can I closely achieve ?: from C++/C# in Python?
                            
                                How is ** implemented in Python?
                            
                                min heap in python
                            
                                What to write into log file?
                            
                                How do I return a CSV from a Pylons app?
                            
                                Embedding a Python shell inside a Python program
                            
                                Python seek on remote file using HTTP
                            
                                Python "import" scope
                            
                                Is this how you paginate, or is there a better algorithm?
                            
                                Split string and just get number in python?
                            
                                python: unicode problem
                            
                                Repeat Python function call on exception?
                            
                                Python Default Inheritance?
                            
                                Fourier transform of a Gaussian is not a Gaussian, but thats wrong! - Python
                            
                                Plot matplotlib on the Web
                            
                                Python ctypes MemoryError in fcgi process from PIL library
                            
                                Logical paradox in python?
                            
                                filepath autocompletion using users input
                            
                                Changing PyScripter to work with different Python Versions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sorting large text data

Tags:

python

sorting

bigdata

fodon

People also ask

1 Answers

urschrei

Recent Activity

Donate For Us