Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):     arr_hashable = arr.values     arr_hashable.flags.writeable = False     hash_ = hash(arr_hashable.data)     return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})  In [13]: _get_array_hash(data) Out[13]: -5522125492475424165  In [14]: _get_array_hash(data) Out[14]: -5522125492475424165

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'  In [16]: data_from_file = pd.read_csv(fpath)  In [17]: _get_array_hash(data_from_file) Out[17]: 6997017925422497085  In [18]: _get_array_hash(data_from_file) Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values,              columns=data_from_file.columns,              index=data_from_file.index)

and it works again

In [25]: _get_array_hash(new_data) Out[25]: -3546154109803008241  In [26]: _get_array_hash(new_data) Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

Is Iterrows faster than apply?

This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.

What does .values do in pandas?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

How do you count occurrences of pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

import pandas as pd import numpy as np  np.random.seed(42) arr = np.random.choice(['foo', 'bar', 42], size=(3,4)) df = pd.DataFrame(arr)  print(df) #      0    1   2    3 # 0   42  foo  42   42 # 1  foo  foo  42  bar # 2   42   42  42   42  from pandas.util import hash_pandas_object h = hash_pandas_object(df)  print(h) # 0     5559921529589760079 # 1    16825627446701693880 # 2     7171023939017372657 # dtype: uint64

You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

Get the same hash value for a Pandas DataFrame each time

Tags:

python

pandas

mkurnikov

People also ask

1 Answers

Jonathan Stray

Recent Activity

Donate For Us

Get the same hash value for a Pandas DataFrame each time

Tags:

python

pandas

mkurnikov

People also ask

1 Answers

Jonathan Stray

Related questions

Recent Activity

Donate For Us