Overview of what I'm trying to do. I have encrypted versions of files that I need to read into pandas. For a couple of reasons it is much better to decrypt into a stream rather than a file, so that's my interest below although I also attempt to decrypt to a file just as an intermediate step (but this also isn't working).
I'm able to get this working for a csv, but not for either hdf or stata (I'd accept an answer that works for either hdf or stata, though the answer might be the same for both, which is why I'm combining in one question).
The code for encrypting/decrypting files is taken from another stackoverflow question (which I can't find at the moment).
import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES
def pad(s):
return s + b"\0" * (AES.block_size - len(s) % AES.block_size)
def encrypt(message, key, key_size=256):
message = pad(message)
iv = Random.new().read(AES.block_size)
cipher = AES.new(key, AES.MODE_CBC, iv)
return iv + cipher.encrypt(message)
def decrypt(ciphertext, key):
iv = ciphertext[:AES.block_size]
cipher = AES.new(key, AES.MODE_CBC, iv)
plaintext = cipher.decrypt(ciphertext[AES.block_size:])
return plaintext.rstrip(b"\0")
def encrypt_file(file_name, key):
with open(file_name, 'rb') as fo:
plaintext = fo.read()
enc = encrypt(plaintext, key)
with open(file_name + ".enc", 'wb') as fo:
fo.write(enc)
def decrypt_file(file_name, key):
with open(file_name, 'rb') as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext, key)
with open(file_name[:-4], 'wb') as fo:
fo.write(dec)
And here's my attempt to extend the code to decrypt to a stream rather than a file.
def decrypt_stream(file_name, key):
with open(file_name, 'rb') as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext, key)
cipherbyte = io.BytesIO()
cipherbyte.write(dec)
cipherbyte.seek(0)
return cipherbyte
Finally, here's the sample program with sample data attempting to make this work:
key = 'this is an example key'[:16]
df = pd.DataFrame({ 'x':[1,2], 'y':[3,4] })
df.to_csv('test.csv',index=False)
df.to_hdf('test.h5','test',mode='w')
df.to_stata('test.dta')
encrypt_file('test.csv',key)
encrypt_file('test.h5',key)
encrypt_file('test.dta',key)
decrypt_file('test.csv.enc',key)
decrypt_file('test.h5.enc',key)
decrypt_file('test.dta.enc',key)
# csv works here but hdf and stata don't
# I'm less interested in this part but include it for completeness
df_from_file = pd.read_csv('test.csv')
df_from_file = pd.read_hdf('test.h5','test')
df_from_file = pd.read_stata('test.dta')
# csv works here but hdf and stata don't
# the hdf and stata lines below are what I really need to get working
df_from_stream = pd.read_csv( decrypt_stream('test.csv.enc',key) )
df_from_stream = pd.read_hdf( decrypt_stream('test.h5.enc',key), 'test' )
df_from_stream = pd.read_stata( decrypt_stream('test.dta.enc',key) )
Unfortunately I don't think I can shrink this code anymore and still have a complete example.
Again, my hope would be to have all 4 non-working lines above working (file and stream for hdf and stata) but I'm happy to accept an answer that works for either the hdf stream alone or the stata stream alone.
Also, I'm open to other encryption alternatives, I just used some existing pycrypto-based code that I found here on SO. My work explicitly requires 256-bit AES but beyond that I'm open so this solution needn't be based specifically on the pycrypto library or the specific code example above.
Info on my setup:
python: 3.4.3
pandas: 0.17.0 (anaconda 2.3.0 distribution)
mac os: 10.11.3
The biggest issue is the padding/unpadding method. It assumes that the null character can't be part of the actual content. Since stata/hdf
files are binary, it's safer to pad using the number of extra bytes we use, encoded as a character. This number will be used during unpadding.
Also for this time being, read_hdf
doesn't support reading from a file like object, even if the API documentation claims so. If we restrict ourselves to the stata
format, the following code will perform what you need:
import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES
def pad(s):
n = AES.block_size - len(s) % AES.block_size
return s + n * chr(n)
def unpad(s):
return s[:-ord(s[-1])]
def encrypt(message, key, key_size=256):
message = pad(message)
iv = Random.new().read(AES.block_size)
cipher = AES.new(key, AES.MODE_CBC, iv)
return iv + cipher.encrypt(message)
def decrypt(ciphertext, key):
iv = ciphertext[:AES.block_size]
cipher = AES.new(key, AES.MODE_CBC, iv)
plaintext = cipher.decrypt(ciphertext[AES.block_size:])
return unpad(plaintext)
def encrypt_file(file_name, key):
with open(file_name, 'rb') as fo:
plaintext = fo.read()
enc = encrypt(plaintext, key)
with open(file_name + ".enc", 'wb') as fo:
fo.write(enc)
def decrypt_stream(file_name, key):
with open(file_name, 'rb') as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext, key)
cipherbyte = io.BytesIO()
cipherbyte.write(dec)
cipherbyte.seek(0)
return cipherbyte
key = 'this is an example key'[:16]
df = pd.DataFrame({
'x': [1,2],
'y': [3,4]
})
df.to_stata('test.dta')
encrypt_file('test.dta', key)
print pd.read_stata(decrypt_stream('test.dta.enc', key))
Output:
index x y
0 0 1 3
1 1 2 4
In python 3 you can use the following pad
, unpad
versions:
def pad(s):
n = AES.block_size - len(s) % AES.block_size
return s + bytearray([n] * n)
def unpad(s):
return s[:-s[-1]]
What worked for me in the case of .h5
format and the cryptography
library was:
from cryptography.fernet import Fernet
def read_h5_file(new_file:str, decrypted: bytes, verbose=False):
with open(new_file, 'wb') as f:
f.write(decrypted)
print(f'Created {new_file}') if verbose else ''
df = pd.read_hdf(new_file)
os.remove(new_file)
print(f'Deleted {new_file}') if verbose else ''
return df
with open(path_to_file, 'rb') as f:
data = f.read()
fernet = Fernet(key)
decrypted = fernet.decrypt(data)
new_file = './example_path/example.h5'
df = read_h5_file(new_file, decrypted, verbose=verbose)
So I created a .h5
file. Read its content. Return it with the function. Delete the decrypted file again.
Maybe this approach helps, as I didn't find any other or similar solution on this online.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With