Update: I have asked a new question that gives a full code example: Decrypting a file to a stream and reading the stream into pandas (hdf or stata)
My basic problem is that I need to keep data encrypted and then read into pandas. I'm open to a variety of solutions but the encryption needs to be AES256. As of now, I'm using PyCrypto, but that's not a requirement.
My current solution is:
That's far from ideal because there is temporarily an un-encrypted file sitting on the harddrive, and with user error it could be longer than temporary. Equally bad, the IO is essentially tripled as an un-encrypted file is written out and then read into pandas.
Ideally, encryption would be built into HDF or some other binary format that pandas can read, but it doesn't seem to be as far as I can tell.
(Note: this is on a linux box, so perhaps there is a shell script solution, although I'd probably prefer to avoid that if it can all be done inside of python.)
Second best, and still a big improvement, would be to de-crypt the file into memory and read directly into pandas without ever creating a new (un-encrypted) file. So far I haven't been able to do that though.
Here's some pseudo code to hopefully illustrate.
# this works, but less safe and IO intensive
decrypt_to_file('encrypted_csv', 'decrypted_csv') # outputs decrypted file to disk
pd.read_csv('decrypted_csv')
# this is what I want, but don't know how to make it work
# no decrypted file is ever created
pd.read_csv(decrypt_to_memory('encrypted_csv'))
So that's what I'm trying to do, but also interested in other alternatives that accomplish the same thing (are efficient and don't create a temp file).
Update: Probably there is not going to be a direct answer to this question -- not too surprising, but I thought I would check. I think the answer will involve something like BytesIO (mentioned by DSM) or mmap (mentioned by Mad Physicist), so I'm exploring those. Thanks to all who made a sincere attempt to help here.
If you are already using Linux, and you look for a "simple" alternative, which does not involve encrypting\decrypting on the Python level, you could use native file system encryption with ext4.
This approach might make your installation complicated, but it has the following advantages:
Disadvantage:
As for writing the decrypted file to memory you can use /dev/shm
as your write location, thus sparing the need to do complicated streaming or overriding pandas methods.
In short, /dev/shm
uses the memory (in some cases your tmpfs
does that too), and it much faster than your normal hard drive (info /dev/shm/).
I hope this helps you in a way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With