To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by reference in datatables avoids the copying of the whole object in memory (see the example in this quora answer) as it is the case in dataframes.
I've found on multiple occasions that the speed and memory differences that arise from data.table
's behavior is a crucial element that allows one to work with some big datasets while it wouldn't be possible with data.frame
's behavior.
Therefore, what I'm wondering is : in Python, how do Pandas
' dataframes behave in this regard ?
Bonus question : if Pandas' dataframes are closer to R dataframes than to R datatables, and have the same down side (a full copy of the object when assigning/modifying column), is there a Python equivalent to R's data.table
package ?
EDIT per comment request : Code examples :
R dataframes :
# renaming a column
colnames(mydataframe)[1] <- "new_column_name"
R datatables :
# renaming a column
library(data.table)
setnames(mydatatable, 'old_column_name', 'new_column_name')
In Pandas :
mydataframe.rename(columns = {'old_column_name': 'new_column_name'}, inplace=True)
Pandas operates more like data.frame
in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:
First define a program that will test this:
%%file df_memprofile.py
import numpy as np
import pandas as pd
def foo():
x = np.random.rand(1000000, 5)
y = pd.DataFrame(x, columns=list('abcde'))
y.rename(columns = {'e': 'f'}, inplace=True)
return y
Then load the memory profiler and run + profile the function
%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()
I get the following output:
Filename: /Users/jakevdp/df_memprofile.py
Line # Mem usage Increment Line Contents
================================================
4 66.1 MiB 66.1 MiB def foo():
5 104.2 MiB 38.2 MiB x = np.random.rand(1000000, 5)
6 104.4 MiB 0.2 MiB y = pd.DataFrame(x, columns=list('abcde'))
7 142.6 MiB 38.2 MiB y.rename(columns = {'e': 'f'}, inplace=True)
8 142.6 MiB 0.0 MiB return y
You can see a couple things:
when y
is created, it is just a light wrapper around the original array: i.e. no data is copied.
When the column in y
is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x
is created in the first place).
So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.
Edit: Note that rename()
has an argument copy
that controls this behavior, and defaults to True. For example, using this:
y.rename(columns = {'e': 'f'}, inplace=True, copy=False)
... results in an inplace operation without copying data.
Alternatively, you can modify the columns
attribute directly:
y.columns = ['a', 'b', 'c', 'd', 'f']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With