I have 2 fixed width files like below (only change is Date value starting at position 14).
sample_hash1.txt
GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018
sample_hash2.txt
GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018
Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this.
import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])
df1
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
df2
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column).
print(hash(df1.values.tostring()))
-3571422965125408226
print(hash(df2.values.tostring()))
5039867957859242153
I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation.
I checked this post Most efficient property to hash for numpy array but that is not what i am looking for
<colspec> The <colspec> element contains a column specification for a table, including assigning a column name and number, cell content alignment, and column width. <thead> The <thead> element is a table header that precedes the table body ( <tbody> ) element in a complex table.
Select the Dataframe column using the column name and subscript operator i.e. df['C']. It returns the column 'C' as a Series object of only bool values. After that, call the sum() function on this boolean Series object, and it will return the count of only True values in the Series/column.
A DataFrame has both rows and columns, so it has two dimensions.
You can now use pd.util.hash_pandas_object
hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest()
For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With