Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Pandas' dataframes (Python) closer to R's dataframes or datatables? [closed]

To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by reference in datatables avoids the copying of the whole object in memory (see the example in this quora answer) as it is the case in dataframes.

I've found on multiple occasions that the speed and memory differences that arise from data.table's behavior is a crucial element that allows one to work with some big datasets while it wouldn't be possible with data.frame's behavior.

Therefore, what I'm wondering is : in Python, how do Pandas' dataframes behave in this regard ?

Bonus question : if Pandas' dataframes are closer to R dataframes than to R datatables, and have the same down side (a full copy of the object when assigning/modifying column), is there a Python equivalent to R's data.table package ?


EDIT per comment request : Code examples :

R dataframes :

# renaming a column
colnames(mydataframe)[1] <- "new_column_name"

R datatables :

# renaming a column
library(data.table)
setnames(mydatatable, 'old_column_name', 'new_column_name')

In Pandas :

mydataframe.rename(columns = {'old_column_name': 'new_column_name'}, inplace=True)
like image 382
François M. Avatar asked Dec 14 '17 17:12

François M.


1 Answers

Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:

First define a program that will test this:

%%file df_memprofile.py
import numpy as np
import pandas as pd

def foo():
    x = np.random.rand(1000000, 5)
    y = pd.DataFrame(x, columns=list('abcde'))
    y.rename(columns = {'e': 'f'}, inplace=True)
    return y

Then load the memory profiler and run + profile the function

%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()

I get the following output:

Filename: /Users/jakevdp/df_memprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.1 MiB     66.1 MiB   def foo():
     5    104.2 MiB     38.2 MiB       x = np.random.rand(1000000, 5)
     6    104.4 MiB      0.2 MiB       y = pd.DataFrame(x, columns=list('abcde'))
     7    142.6 MiB     38.2 MiB       y.rename(columns = {'e': 'f'}, inplace=True)
     8    142.6 MiB      0.0 MiB       return y

You can see a couple things:

  1. when y is created, it is just a light wrapper around the original array: i.e. no data is copied.

  2. When the column in y is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x is created in the first place).

So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.


Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:

y.rename(columns = {'e': 'f'}, inplace=True, copy=False)

... results in an inplace operation without copying data.

Alternatively, you can modify the columns attribute directly:

y.columns = ['a', 'b', 'c', 'd', 'f']
like image 186
jakevdp Avatar answered Oct 17 '22 09:10

jakevdp