Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does pandas groupby pass by reference or value?

Let's say I have a pandas DataFrame data and I'd like to split it by a certain column, col, according to

def split_by_column(data, column):

    chunk_list = [(k,g) for k, g in data.groupby(column)]
    return dict(chunk_list)


collection = split_by_column(data, 'col')

This way I can easily access and apply functions to this collection later.

If I for instance have an object which has both data and collection as instance variables, do I have two separate copies of the data in the memory or does the dictionary contain references to the appropriate chucks in data?

like image 865
signalfel Avatar asked Nov 15 '22 19:11

signalfel


1 Answers

I tried this:

data=pd.DataFrame({'a':[1,2,3,4], 'b':[6,9,8,9]})
print('data initial:',data)
def split_by_column(data, column):
    chunk_list = [(k,g) for k, g in data.groupby(column)]
    return dict(chunk_list)
collection = split_by_column(data, 'b')
print('collection initial:',collection)

Output is:

data initial:    a  b
0  1  6
1  2  9
2  3  8
3  4  9
collection initial: {6:    a  b
0  1  6, 8:    a  b
2  3  8, 9:    a  b
1  2  9
3  4  9}

If I change data now by

data.at[3,'a']=5

and print data and collection again, the output is this:

data new:    a  b
0  1  6
1  2  9
2  3  8
3  5  9
collection new: {6:    a  b
0  1  6, 8:    a  b
2  3  8, 9:    a  b
1  2  9
3  4  9}

Since I am also just starting to explore pandas, I can not tell you, what the underlying mechanisms are, but since the value 5 is only appearing in the dataframe, but not in the dict, I conclude, that you have two different copies of your data.

I hope, this is helpful for you. Best, lepakk

like image 63
Lepakk Avatar answered Dec 19 '22 11:12

Lepakk