I have 2 fixed width files like below (only change is Date value starting at position 14). sample_hash1.txt <pre class="prettyprint"><code>GOKULKRISHNA 04/17/2018 ABCDEFGHIJKL 04/17/2018 111111111111 04/17/2018 </code></pre> sample_hash2.txt <pre class="prettyprint"><code>GOKULKRISHNA 04/16/2018 ABCDEFGHIJKL 04/16/2018 111111111111 04/16/2018 </code></pre> Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this. <pre class="prettyprint"><code>import pandas as pd df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)]) df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)]) </code></pre> df1 <pre class="prettyprint"><code> GOKULKRISHNA 0 ABCDEFGHIJKL 1 111111111111 </code></pre> df2 <pre class="prettyprint"><code> GOKULKRISHNA 0 ABCDEFGHIJKL 1 111111111111 </code></pre> Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column). <pre class="prettyprint"><code>print(hash(df1.values.tostring())) -3571422965125408226 print(hash(df2.values.tostring())) 5039867957859242153 </code></pre> I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation. I checked this post Most efficient property to hash for numpy array but that is not what i am looking for

You can now use <code>pd.util.hash_pandas_object</code> <pre class="prettyprint"><code>hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() </code></pre> For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?

Tags:

python

pandas

hash

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this.

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what i am looking for

689

asked Apr 17 '18 16:04

goks

1 Answers

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest()

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

179

answered Sep 29 '22 02:09

Roko Mijic

Related questions
                            
                                Tensorflow: How to set the learning rate in log scale and some Tensorflow questions
                            
                                Asynchronous HTTP calls using aiohttp/asyncio fail with "Cannot connect to host [Network is unreachable]" [duplicate]
                            
                                How to use telethon in a thread
                            
                                Creating a minimal graph representing all combinations of 3-bit binary strings
                            
                                TensorFlow ValueError: The channel dimension of the inputs should be defined. Found `None`
                            
                                ts.plot() and dataFrame.plot() throwing error: " NameError: name '_converter' is not defined"
                            
                                How to pass an arbitrary argument to Flask through app.run()?
                            
                                Is there a maximum character limit to random seed?
                            
                                Segnet in Keras: total size of new array must be unchanged error
                            
                                Bokeh- datetime x_range: 'ValueError, Unrecognized range input'
                            
                                Cython: Compile a Standalone Static Executable
                            
                                Repeat last column in numpy array
                            
                                Explode column of list to multiple rows
                            
                                Accessing '.pickle' file in Google Colab
                            
                                chmod 777 to python script
                            
                                User model other than AUTH_USER_MODEL in Django REST Framework
                            
                                Python all points on circle given radius and center
                            
                                Confusing behavior of np.random.multivariate_normal
                            
                                Is there an easy way to have "checkpoints" in an extended python script?
                            
                                how to install pydot & graphviz on google colab?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With