I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching a memory limit. Is there an operation, similar to pop but for a large segment, that will simultaneously remove a portion of the DataFrame and allow me to assign it to a new DataFrame? Something like this: <pre class="prettyprint"><code># Assume I have initialized a DataFrame (called "all") which contains my large dataset, # with a boolean column called "test" which indicates whether a record should be used for # testing. print len(all) # 10000000 test = all.pop_large_segment(all[test]) # not a real command, just a place holder print len(all) # 8000000 print len(test) # 2000000 </code></pre>

If you have the space to add one more column, you could add one with a random value that you could then filter on for your testing. Here I used uniform between 0 and 1, but you could use anything if you wanted a different proportion. <pre class="prettyprint"><code>df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]}) df['split'] = np.random.randint(0, 2, size=len(df)) </code></pre> Of course that requires you have space to add an entirely new column - especially if your data is very long, maybe you don't. Another option would work, for example, if your data was in csv format and you knew the number of rows. Do similar to the above with the <code>randomint</code>, but pass that list into the <code>skiprows</code> argument of Pandas <code>read_csv()</code>: <pre class="prettyprint"><code>num_rows = 100000 all = range(num_rows) some = np.random.choice(all, replace=False, size=num_rows/2) some.sort() trainer_df = pd.read_csv(path, skiprows=some) rest = [i for i in all if i not in some] rest.sort() df = pd.read_csv(path, skiprows=rest) </code></pre> It's a little clunky up front, especially with the loop in the list comprehension, and creating those lists in memory is unfortunate, but it should still be better memory-wide than just creating an entire copy of half the data. To make it even more memory friendly you could load the trainer subset, train the model, then overwrite the training dataframe with the rest of the data, then apply the model. You'll be stuck carrying <code>some</code> and <code>rest</code> around, but you'll never have to load both halves of the data at the same time.

Splitting a large Pandas Dataframe with minimal memory footprint

Tags:

python

pandas

dataframe

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching a memory limit.

Is there an operation, similar to pop but for a large segment, that will simultaneously remove a portion of the DataFrame and allow me to assign it to a new DataFrame? Something like this:

# Assume I have initialized a DataFrame (called "all") which contains my large dataset, 
# with a boolean column called "test" which indicates whether a record should be used for
# testing.
print len(all)
# 10000000 
test = all.pop_large_segment(all[test]) # not a real command, just a place holder
print len(all)
# 8000000
print len(test)     
# 2000000

553

asked Jun 26 '16 14:06

mgoldwasser

1 Answers

If you have the space to add one more column, you could add one with a random value that you could then filter on for your testing. Here I used uniform between 0 and 1, but you could use anything if you wanted a different proportion.

df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
df['split'] = np.random.randint(0, 2, size=len(df))

Of course that requires you have space to add an entirely new column - especially if your data is very long, maybe you don't.

Another option would work, for example, if your data was in csv format and you knew the number of rows. Do similar to the above with the randomint, but pass that list into the skiprows argument of Pandas read_csv():

num_rows = 100000
all = range(num_rows)

some = np.random.choice(all, replace=False, size=num_rows/2)
some.sort()
trainer_df = pd.read_csv(path, skiprows=some)

rest = [i for i in all if i not in some]
rest.sort()
df = pd.read_csv(path, skiprows=rest)

It's a little clunky up front, especially with the loop in the list comprehension, and creating those lists in memory is unfortunate, but it should still be better memory-wide than just creating an entire copy of half the data.

To make it even more memory friendly you could load the trainer subset, train the model, then overwrite the training dataframe with the rest of the data, then apply the model. You'll be stuck carrying some and rest around, but you'll never have to load both halves of the data at the same time.

192

answered Sep 29 '22 13:09

Jeff

Related questions
                            
                                Designing a program entry point in python
                            
                                Optimal gunicorn-worker configuration (number and class) for Python REST APIs
                            
                                What is the maximum number of VALUES that can be put in a PostgreSQL INSERT statement?
                            
                                Running pudb inside docker container
                            
                                How can I set the language in text with python-docx
                            
                                String performance - Python 2.7 vs Python 3.4 under Windows 10 vs. Ubuntu
                            
                                Multiple Linear Regression Model by using Tensorflow
                            
                                SyntaxNet creating tree to root verb
                            
                                Dedupe in Python
                            
                                Turn python script into a function
                            
                                How to configure the Jenkins ShiningPanda plugin Python Installations
                            
                                changing update rate with gpsd and python
                            
                                Is a constant list used in a loop constructed/deleted with each pass?
                            
                                How to call audio plugins from within Python?
                            
                                Unable to view files in a browser with python http server
                            
                                Python threading.Timer object not functioning when compiled to .exe
                            
                                Function which returns the least-squares solution to a linear matrix equation
                            
                                Get index of the minimum of multi-index Pandas DataFrame using level
                            
                                Using SSHTunnelForwarder to connect to a MySQL db via SSH
                            
                                Fast algorithm to find indices where multiple arrays have the same value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With