Let's say I have a pandas DataFrame data and I'd like to split it by a certain column, col, according to
def split_by_column(data, column):
chunk_list = [(k,g) for k, g in data.groupby(column)]
return dict(chunk_list)
collection = split_by_column(data, 'col')
This way I can easily access and apply functions to this collection later.
If I for instance have an object which has both data and collection as instance variables, do I have two separate copies of the data in the memory or does the dictionary contain references to the appropriate chucks in data?
I tried this:
data=pd.DataFrame({'a':[1,2,3,4], 'b':[6,9,8,9]})
print('data initial:',data)
def split_by_column(data, column):
chunk_list = [(k,g) for k, g in data.groupby(column)]
return dict(chunk_list)
collection = split_by_column(data, 'b')
print('collection initial:',collection)
Output is:
data initial: a b
0 1 6
1 2 9
2 3 8
3 4 9
collection initial: {6: a b
0 1 6, 8: a b
2 3 8, 9: a b
1 2 9
3 4 9}
If I change data now by
data.at[3,'a']=5
and print data and collection again, the output is this:
data new: a b
0 1 6
1 2 9
2 3 8
3 5 9
collection new: {6: a b
0 1 6, 8: a b
2 3 8, 9: a b
1 2 9
3 4 9}
Since I am also just starting to explore pandas, I can not tell you, what the underlying mechanisms are, but since the value 5 is only appearing in the dataframe, but not in the dict, I conclude, that you have two different copies of your data.
I hope, this is helpful for you. Best, lepakk
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With