Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read a zipped csv from S3 into python dataframe

I Have a bucket in S3 with a csv in it.
There are no none-ASCII characters in it.
when I try to read it using python it will not let me.
I used: df = self.s3_input_bucket.get_file_contents_from_s3(path)
as I used on many occasions recently in the same script, and get: UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 14: invalid start byte.
to make sure it goes to the right path, i put another plain text file in the same folder and was able to read it without a problem.

I tried many solutions I found on other questions. just one example, I saw a solution someone offered, to try this:

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')
from this question: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
but how can I use them in this case?
this did not work:

str = unicode(self.s3_input_bucket.get_file_contents_from_s3(path), errors='replace')

like image 472
Zusman Avatar asked Oct 11 '25 17:10

Zusman


1 Answers

Apparently, I tried to open a zipped filed.
after much research, I was able to read it into a data frame using this code:

import zipfile
import s3fs
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})

market_score = self._zipped_csv_from_s3_to_df(os.path.join(my-bucket, path-in-bucket), s3_fs)

def _zipped_csv_from_s3_to_df(self, path, s3_fs):
    with s3_fs.open(path) as zipped_dir:
            with zipfile.ZipFile(zipped_dir, mode='r') as zipped_content:
                for score_file in zipped_content.namelist():
                    with zipped_content.open(score_file) as scores:
                        return pd.read_csv(scores)

I will always have only one csv file inside the zip, so that is why I know I can return on the first iteration.
however this function iterate over the files in the zip.