pandas data frame - select rows and clear memory?

Tags:

I have a large pandas dataframe (size = 3 GB):

x = read.table('big_table.txt', sep='\t', header=0, index_col=0)

Because I'm working under memory constraints, I subset the dataframe:

Click to copy

rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]

The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.

I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.

I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!

793

asked Oct 30 '13 04:10

a b

2 Answers

You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

Click to copy

reader = pd.read_csv('big_table.txt', sep='\t', header=0, 
                     index_col=0, usecols=the_columns_i_want_to_use, 
                     chunksize=10000)

df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage

answered Oct 08 '22 22:10

Jeff

I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.

You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so

Click to copy

def big_table_generator(filename):
    with open(filename, 'rt') as f:
        for line in f:
            if is_needed_row(line):   #Check if you want this row
                #cut_columns() return a list with only the selected columns
                record = cut_columns(line)    
                yield column


gen = big_table_generator('big_table.txt')
df = pandas.DataFrame.from_records(list(gen))

Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.

You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.

answered Oct 08 '22 22:10

tinproject

Related questions
                            
                                Pickling multiple dictionaries
                            
                                python pandas operations on columns
                            
                                How to check if PostgreSQL schema exists using SQLAlchemy?
                            
                                Django - how to filter using QuerySet to get subset of objects?
                            
                                Produce interleaving bit patterns (morton keys) for 32 bit , 64 bit and 128bit
                            
                                Differentiating a product with an unknown function - sympy
                            
                                Python parallel threads
                            
                                comparison of list using cmp or ==
                            
                                Implementing fancy indexing in a class
                            
                                Writing .npy (numpy binary format) from java
                            
                                How to write a function that takes a positive integer N and returns a list of the first N natural numbers
                            
                                Parse and format the date from the GitHub API in Python [duplicate]
                            
                                How to validate integer range in Flask routing (Werkzeug)?
                            
                                How to download big file in python via ftp (with monitoring & reconnect)?
                            
                                Hashing file in Python 3?
                            
                                "Must construct a QApplication before a QPaintDevice" from QWidget
                            
                                Principal Component Analysis not working
                            
                                can xml.etree.ElementTree.write() integer values for a given Element?
                            
                                Number of actual function arguments
                            
                                threading a bottle app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas data frame - select rows and clear memory?

Tags:

python

memory-management

memory

memory-leaks

pandas

a b

People also ask

2 Answers

Jeff

tinproject

Recent Activity

Donate For Us