My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr): arr_hashable = arr.values arr_hashable.flags.writeable = False hash_ = hash(arr_hashable.data) return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]}) In [13]: _get_array_hash(data) Out[13]: -5522125492475424165 In [14]: _get_array_hash(data) Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv' In [16]: data_from_file = pd.read_csv(fpath) In [17]: _get_array_hash(data_from_file) Out[17]: 6997017925422497085 In [18]: _get_array_hash(data_from_file) Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values, columns=data_from_file.columns, index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data) Out[25]: -3546154109803008241 In [26]: _get_array_hash(new_data) Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object
(source code) which was recently made public in pandas.util
. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd import numpy as np np.random.seed(42) arr = np.random.choice(['foo', 'bar', 42], size=(3,4)) df = pd.DataFrame(arr) print(df) # 0 1 2 3 # 0 42 foo 42 42 # 1 foo foo 42 bar # 2 42 42 42 42 from pandas.util import hash_pandas_object h = hash_pandas_object(df) print(h) # 0 5559921529589760079 # 1 16825627446701693880 # 2 7171023939017372657 # dtype: uint64
You can always do hash_pandas_object(df).sum()
if you want an overall hash of all rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With