I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets. <pre class="prettyprint"><code>data = data.sort(columns=["P_VALUE"], ascending=True, axis=0) </code></pre> Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

In the past, I've used Linux's pair of venerable <code>sort</code> and <code>split</code> utilities, to sort massive files that choked pandas. I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years. <hr> Say your file is called <code>stuff.csv</code>, and looks like this: <pre class="prettyprint"><code>4.9,3.0,1.4,0.6 4.8,2.8,1.3,1.2 </code></pre> Then the following command will sort it by the 3rd column: <pre class="prettyprint"><code>sort --parallel=8 -t . -nrk3 stuff.csv </code></pre> Note that the number of threads here is set to 8. <hr> The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So <pre class="prettyprint"><code>split -l 100000 stuff.csv stuff </code></pre> would split the file into files of length at most 100000 lines. Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) <code>sort</code>: <pre class="prettyprint"><code>sort -m sorted_stuff_* > final_sorted_stuff.csv </code></pre> <hr> Finally, if your file is not in CSV (say it is a <code>tgz</code> file), then you should find a way to pipe a CSV version of it into <code>split</code>.

Sorting in pandas for large datasets

Tags:

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.

data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)

Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

445

asked Jan 22 '14 00:01

user1867185

2 Answers

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.

I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.

Say your file is called stuff.csv, and looks like this:

4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2

Then the following command will sort it by the 3rd column:

sort --parallel=8 -t . -nrk3 stuff.csv

Note that the number of threads here is set to 8.

The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So

split -l 100000 stuff.csv stuff

would split the file into files of length at most 100000 lines.

Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:

sort -m sorted_stuff_* > final_sorted_stuff.csv

Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

answered Oct 10 '22 17:10

Ami Tavory

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.

About the sorting problem, there are at least three possible ways to solve it with that approach.

First, you can try to use pandas directly, querying the HDF-stored-DataFrame.

Second, you can use PyTables, which pandas uses under the hood.

Francesc Alted gives a hint in the PyTables mailing list:

The simplest way is by setting the sortby parameter to true in the Table.copy() method. This triggers an on-disk sorting operation, so you don't have to be afraid of your available memory. You will need the Pro version for getting this capability.

In the docs, it says:

sortby : If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used

Third, still with PyTables, you can use the method Table.itersorted().

From the docs:

Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)

Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.

Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.

This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.

Introduction

This notebook explores a 3.9Gb CSV file.

This notebook is a primer on out-of-memory data analysis with

pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.

IPython notebook: An interface for writing and sharing python code, text, and plots.

SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.

Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.

Requirements

import pandas as pd
from sqlalchemy import create_engine # database connection

Import the CSV data into SQLite

Load the CSV, chunk-by-chunk, into a DataFrame

Process the data a bit, strip out uninteresting columns

Append it to the SQLite database

disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory

chunksize = 20000
index_start = 1

for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):

    # do stuff   

    df.index += index_start

    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1

Query value counts and order the results

Housing and Development Dept receives the most complaints

df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
                       'FROM data '
                       'GROUP BY Agency '
                       'ORDER BY -num_complaints', disk_engine)

Limiting the number of sorted entries

What's the most 10 common complaint in each city?

df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
                            'FROM data '
                            'GROUP BY `City` '
                   'ORDER BY -num_complaints '
                   'LIMIT 10 ', disk_engine)

Possibly related and useful links

Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html

answered Oct 10 '22 15:10

iled

Related questions
                            
                                ASP.NET Overflow or underflow in the arithmetic operation when returning large file bigger 1 GB
                            
                                Knex.js Schema: Multicolumn Index
                            
                                How to transpose/pivot data in hive?
                            
                                Shrink everything in a div
                            
                                Check if an excel cell exists on another worksheet in a column - and return the contents of a different column
                            
                                Version of WebKit in JavaFX 8 WebView
                            
                                Getting app run id for a Spark job
                            
                                Change public IP address of EC2 instance without stop/start or elastic IP
                            
                                MongoDB server can still be accessed without credentials
                            
                                Is there a way to change variable values while debugging JavaScript?
                            
                                Expressing IS NOT NULL without causing a PEP8 error
                            
                                Knockout select dropdown disable item

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With