I have used tweepy to store the text of tweets in a csv file using Python csv.writer(
), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8')
as the csv reader reads data as strings i.e. type(row[3])
is 'str'
and I can't seem to convert it into bytes
, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | @abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b
prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval()
function as @Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b
indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval
to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With