I have a large pandas dataframe (size = 3 GB):
x = read.table('big_table.txt', sep='\t', header=0, index_col=0)
Because I'm working under memory constraints, I subset the dataframe:
rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]
The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.
I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.
I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!
Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True , so you don't create copies. You can fix this by typing %reset Out to clear your history.
You can delete a list of rows from Pandas by passing the list of indices to the drop() method. In this code, [5,6] is the index of the rows you want to delete. axis=0 denotes that rows should be deleted from the dataframe.
Delete Top N Rows of DataFrame Using drop() By default axis = 0 meaning to delete rows. Use axis=1 or columns param to delete columns. Use inplace=True to delete row/column in place meaning on existing DataFrame with out creating copy.
You are much better off doing something like this:
Specify usecols
to sub-select which columns you want in the first place to read_csv
, see here.
Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.
Pseudo-code ish:
reader = pd.read_csv('big_table.txt', sep='\t', header=0,
index_col=0, usecols=the_columns_i_want_to_use,
chunksize=10000)
df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])
This will have a constant memory usage (the size of a chunk)
plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage
I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.
You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so
def big_table_generator(filename):
with open(filename, 'rt') as f:
for line in f:
if is_needed_row(line): #Check if you want this row
#cut_columns() return a list with only the selected columns
record = cut_columns(line)
yield column
gen = big_table_generator('big_table.txt')
df = pandas.DataFrame.from_records(list(gen))
Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.
You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With