Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this.

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what i am looking for

like image 689
goks Avatar asked Apr 17 '18 16:04

goks


People also ask

What is Colspecs?

<colspec> The <colspec> element contains a column specification for a table, including assigning a column name and number, cell content alignment, and column width. <thead> The <thead> element is a table header that precedes the table body ( <tbody> ) element in a complex table.

How do you count trues in a data frame?

Select the Dataframe column using the column name and subscript operator i.e. df['C']. It returns the column 'C' as a Series object of only bool values. After that, call the sum() function on this boolean Series object, and it will return the count of only True values in the Series/column.

What is dimension of DataFrame in Python?

A DataFrame has both rows and columns, so it has two dimensions.


1 Answers

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() 

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

like image 179
Roko Mijic Avatar answered Sep 29 '22 02:09

Roko Mijic