I Have a bucket in S3 with a csv in it.
There are no none-ASCII characters in it.
when I try to read it using python it will not let me. 
I used: df = self.s3_input_bucket.get_file_contents_from_s3(path)
as I used on many occasions recently in the same script, and get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 14: invalid start byte.
to make sure it goes to the right path, i put another plain text file in the same folder and was able to read it without a problem.
I tried many solutions I found on other questions. just one example, I saw a solution someone offered, to try this: 
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
from this question: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c 
but how can I use them in this case? 
this did not work:
str = unicode(self.s3_input_bucket.get_file_contents_from_s3(path), errors='replace')
Apparently, I tried to open a zipped filed. 
after much research, I was able to read it into a data frame using this code:
import zipfile
import s3fs
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})
market_score = self._zipped_csv_from_s3_to_df(os.path.join(my-bucket, path-in-bucket), s3_fs)
def _zipped_csv_from_s3_to_df(self, path, s3_fs):
    with s3_fs.open(path) as zipped_dir:
            with zipfile.ZipFile(zipped_dir, mode='r') as zipped_content:
                for score_file in zipped_content.namelist():
                    with zipped_content.open(score_file) as scores:
                        return pd.read_csv(scores)
I will always have only one csv file inside the zip, so that is why I know I can return on the first iteration.
 however this function iterate over the files in the zip.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With