I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching a memory limit.
Is there an operation, similar to pop but for a large segment, that will simultaneously remove a portion of the DataFrame and allow me to assign it to a new DataFrame? Something like this:
# Assume I have initialized a DataFrame (called "all") which contains my large dataset,
# with a boolean column called "test" which indicates whether a record should be used for
# testing.
print len(all)
# 10000000
test = all.pop_large_segment(all[test]) # not a real command, just a place holder
print len(all)
# 8000000
print len(test)
# 2000000
Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.
Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.
Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
If you have the space to add one more column, you could add one with a random value that you could then filter on for your testing. Here I used uniform between 0 and 1, but you could use anything if you wanted a different proportion.
df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
df['split'] = np.random.randint(0, 2, size=len(df))
Of course that requires you have space to add an entirely new column - especially if your data is very long, maybe you don't.
Another option would work, for example, if your data was in csv format and you knew the number of rows. Do similar to the above with the randomint
, but pass that list into the skiprows
argument of Pandas read_csv()
:
num_rows = 100000
all = range(num_rows)
some = np.random.choice(all, replace=False, size=num_rows/2)
some.sort()
trainer_df = pd.read_csv(path, skiprows=some)
rest = [i for i in all if i not in some]
rest.sort()
df = pd.read_csv(path, skiprows=rest)
It's a little clunky up front, especially with the loop in the list comprehension, and creating those lists in memory is unfortunate, but it should still be better memory-wide than just creating an entire copy of half the data.
To make it even more memory friendly you could load the trainer subset, train the model, then overwrite the training dataframe with the rest of the data, then apply the model. You'll be stuck carrying some
and rest
around, but you'll never have to load both halves of the data at the same time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With