Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decrypting a file to a stream and reading the stream into pandas (hdf or stata)

Overview of what I'm trying to do. I have encrypted versions of files that I need to read into pandas. For a couple of reasons it is much better to decrypt into a stream rather than a file, so that's my interest below although I also attempt to decrypt to a file just as an intermediate step (but this also isn't working).

I'm able to get this working for a csv, but not for either hdf or stata (I'd accept an answer that works for either hdf or stata, though the answer might be the same for both, which is why I'm combining in one question).

The code for encrypting/decrypting files is taken from another stackoverflow question (which I can't find at the moment).

import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES

def pad(s):
    return s + b"\0" * (AES.block_size - len(s) % AES.block_size)

def encrypt(message, key, key_size=256):
    message = pad(message)
    iv = Random.new().read(AES.block_size)
    cipher = AES.new(key, AES.MODE_CBC, iv)
    return iv + cipher.encrypt(message)

def decrypt(ciphertext, key):
    iv = ciphertext[:AES.block_size]
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plaintext = cipher.decrypt(ciphertext[AES.block_size:])
    return plaintext.rstrip(b"\0")

def encrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        plaintext = fo.read()
    enc = encrypt(plaintext, key)
    with open(file_name + ".enc", 'wb') as fo:
        fo.write(enc)

def decrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    with open(file_name[:-4], 'wb') as fo:
        fo.write(dec)

And here's my attempt to extend the code to decrypt to a stream rather than a file.

def decrypt_stream(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    cipherbyte = io.BytesIO()
    cipherbyte.write(dec)
    cipherbyte.seek(0)
    return cipherbyte 

Finally, here's the sample program with sample data attempting to make this work:

key = 'this is an example key'[:16]
df = pd.DataFrame({ 'x':[1,2], 'y':[3,4] })

df.to_csv('test.csv',index=False)
df.to_hdf('test.h5','test',mode='w')
df.to_stata('test.dta')

encrypt_file('test.csv',key)
encrypt_file('test.h5',key)
encrypt_file('test.dta',key)

decrypt_file('test.csv.enc',key)
decrypt_file('test.h5.enc',key)
decrypt_file('test.dta.enc',key)

# csv works here but hdf and stata don't
# I'm less interested in this part but include it for completeness
df_from_file = pd.read_csv('test.csv')
df_from_file = pd.read_hdf('test.h5','test')
df_from_file = pd.read_stata('test.dta')

# csv works here but hdf and stata don't
# the hdf and stata lines below are what I really need to get working
df_from_stream = pd.read_csv( decrypt_stream('test.csv.enc',key) )
df_from_stream = pd.read_hdf( decrypt_stream('test.h5.enc',key), 'test' )
df_from_stream = pd.read_stata( decrypt_stream('test.dta.enc',key) )

Unfortunately I don't think I can shrink this code anymore and still have a complete example.

Again, my hope would be to have all 4 non-working lines above working (file and stream for hdf and stata) but I'm happy to accept an answer that works for either the hdf stream alone or the stata stream alone.

Also, I'm open to other encryption alternatives, I just used some existing pycrypto-based code that I found here on SO. My work explicitly requires 256-bit AES but beyond that I'm open so this solution needn't be based specifically on the pycrypto library or the specific code example above.

Info on my setup:

python: 3.4.3
pandas: 0.17.0 (anaconda 2.3.0 distribution)
mac os: 10.11.3
like image 276
JohnE Avatar asked Sep 25 '22 14:09

JohnE


2 Answers

The biggest issue is the padding/unpadding method. It assumes that the null character can't be part of the actual content. Since stata/hdf files are binary, it's safer to pad using the number of extra bytes we use, encoded as a character. This number will be used during unpadding.

Also for this time being, read_hdf doesn't support reading from a file like object, even if the API documentation claims so. If we restrict ourselves to the stata format, the following code will perform what you need:

import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES

def pad(s):
    n = AES.block_size - len(s) % AES.block_size
    return s + n * chr(n)

def unpad(s):
    return s[:-ord(s[-1])]

def encrypt(message, key, key_size=256):
    message = pad(message)
    iv = Random.new().read(AES.block_size)
    cipher = AES.new(key, AES.MODE_CBC, iv)
    return iv + cipher.encrypt(message)

def decrypt(ciphertext, key):
    iv = ciphertext[:AES.block_size]
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plaintext = cipher.decrypt(ciphertext[AES.block_size:])
    return unpad(plaintext)

def encrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        plaintext = fo.read()
    enc = encrypt(plaintext, key)
    with open(file_name + ".enc", 'wb') as fo:
        fo.write(enc)

def decrypt_stream(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    cipherbyte = io.BytesIO()
    cipherbyte.write(dec)
    cipherbyte.seek(0)
    return cipherbyte

key = 'this is an example key'[:16]

df = pd.DataFrame({
    'x': [1,2],
    'y': [3,4]
})

df.to_stata('test.dta')

encrypt_file('test.dta', key)

print pd.read_stata(decrypt_stream('test.dta.enc', key))

Output:

   index  x  y
0      0  1  3
1      1  2  4

In python 3 you can use the following pad, unpad versions:

def pad(s):
    n = AES.block_size - len(s) % AES.block_size
    return s + bytearray([n] * n)

def unpad(s):
    return s[:-s[-1]]
like image 132
JuniorCompressor Avatar answered Oct 23 '22 23:10

JuniorCompressor


What worked for me in the case of .h5 format and the cryptography library was:

from cryptography.fernet import Fernet

def read_h5_file(new_file:str, decrypted: bytes, verbose=False):
        with open(new_file, 'wb') as f:
                f.write(decrypted)
        print(f'Created {new_file}') if verbose else ''
        df = pd.read_hdf(new_file)
        os.remove(new_file)
        print(f'Deleted {new_file}') if verbose else ''

        return df

with open(path_to_file, 'rb') as f:
    data = f.read()

fernet = Fernet(key)
decrypted = fernet.decrypt(data)
new_file = './example_path/example.h5'

df = read_h5_file(new_file, decrypted, verbose=verbose)

So I created a .h5 file. Read its content. Return it with the function. Delete the decrypted file again.

Maybe this approach helps, as I didn't find any other or similar solution on this online.

like image 36
Createdd Avatar answered Oct 23 '22 22:10

Createdd