Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the same hash value for a Pandas DataFrame each time

Tags:

python

pandas

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):     arr_hashable = arr.values     arr_hashable.flags.writeable = False     hash_ = hash(arr_hashable.data)     return hash_ 

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes()) 

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})  In [13]: _get_array_hash(data) Out[13]: -5522125492475424165  In [14]: _get_array_hash(data) Out[14]: -5522125492475424165  

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'  In [16]: data_from_file = pd.read_csv(fpath)  In [17]: _get_array_hash(data_from_file) Out[17]: 6997017925422497085  In [18]: _get_array_hash(data_from_file) Out[18]: -7524466731745902730 

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values,              columns=data_from_file.columns,              index=data_from_file.index) 

and it works again

In [25]: _get_array_hash(new_data) Out[25]: -3546154109803008241  In [26]: _get_array_hash(new_data) Out[26]: -3546154109803008241 

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

like image 898
mkurnikov Avatar asked Jul 22 '15 15:07

mkurnikov


People also ask

Is Iterrows faster than apply?

This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.

What does .values do in pandas?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

How do you count occurrences of pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.


1 Answers

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

import pandas as pd import numpy as np  np.random.seed(42) arr = np.random.choice(['foo', 'bar', 42], size=(3,4)) df = pd.DataFrame(arr)  print(df) #      0    1   2    3 # 0   42  foo  42   42 # 1  foo  foo  42  bar # 2   42   42  42   42  from pandas.util import hash_pandas_object h = hash_pandas_object(df)  print(h) # 0     5559921529589760079 # 1    16825627446701693880 # 2     7171023939017372657 # dtype: uint64 

You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

like image 85
Jonathan Stray Avatar answered Sep 19 '22 04:09

Jonathan Stray