Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading encrypted files into pandas

Tags:

python

pandas

Update: I have asked a new question that gives a full code example: Decrypting a file to a stream and reading the stream into pandas (hdf or stata)

My basic problem is that I need to keep data encrypted and then read into pandas. I'm open to a variety of solutions but the encryption needs to be AES256. As of now, I'm using PyCrypto, but that's not a requirement.

My current solution is:

  1. Decrypt into a temporary file (CSV, HDF, etc.)
  2. Read the temp file into pandas
  3. Delete the temp file

That's far from ideal because there is temporarily an un-encrypted file sitting on the harddrive, and with user error it could be longer than temporary. Equally bad, the IO is essentially tripled as an un-encrypted file is written out and then read into pandas.

Ideally, encryption would be built into HDF or some other binary format that pandas can read, but it doesn't seem to be as far as I can tell.

(Note: this is on a linux box, so perhaps there is a shell script solution, although I'd probably prefer to avoid that if it can all be done inside of python.)

Second best, and still a big improvement, would be to de-crypt the file into memory and read directly into pandas without ever creating a new (un-encrypted) file. So far I haven't been able to do that though.

Here's some pseudo code to hopefully illustrate.

# this works, but less safe and IO intensive
decrypt_to_file('encrypted_csv', 'decrypted_csv')    # outputs decrypted file to disk
pd.read_csv('decrypted_csv')

# this is what I want, but don't know how to make it work
# no decrypted file is ever created
pd.read_csv(decrypt_to_memory('encrypted_csv'))

So that's what I'm trying to do, but also interested in other alternatives that accomplish the same thing (are efficient and don't create a temp file).

Update: Probably there is not going to be a direct answer to this question -- not too surprising, but I thought I would check. I think the answer will involve something like BytesIO (mentioned by DSM) or mmap (mentioned by Mad Physicist), so I'm exploring those. Thanks to all who made a sincere attempt to help here.

like image 586
JohnE Avatar asked Oct 31 '22 12:10

JohnE


1 Answers

If you are already using Linux, and you look for a "simple" alternative, which does not involve encrypting\decrypting on the Python level, you could use native file system encryption with ext4.

This approach might make your installation complicated, but it has the following advantages:

  • Zero risk of leakage via temporary file.
  • Fast, since the native encryption is in C (although, PyCrypto is also in C, I am guessing it will be faster at the kernel level).

Disadvantage:

  • You need to learn to work with the specific file system commands
  • You current linux kernel is two old
  • You don't know how to upgrade\can't upgrade your linux kernel.

As for writing the decrypted file to memory you can use /dev/shm as your write location, thus sparing the need to do complicated streaming or overriding pandas methods. In short, /dev/shm uses the memory (in some cases your tmpfs does that too), and it much faster than your normal hard drive (info /dev/shm/).

I hope this helps you in a way.

like image 115
oz123 Avatar answered Nov 15 '22 04:11

oz123