The Python CSV writer is adding letters to the beginning of each element and issues with encode

Tags:

So I'm trying to parse out JSON files into a tab delimited file. The parsing seems to work fine and all the data is coming through. Although the oddest thing is happening on the output file. I told it to use a tab delimiter and on the output it does use tabs, but it still seems to keep the single quotes. And for some reason it also seems to be adding the letter B to the beginning. I manually typed in the header, and that works fine, but the data itself is acting weird. Here's an example of the output I'm getting.

Click to copy

id  created text    screen name name    latitude    longitude   place name  place type
b'1234567890'   b'Thu Mar 14 19:39:07 +0000 2013'   "b""I'm at Bank Of America (Wayne, MI) http://t.co/asdf"""  b'userid'   b'username' 42.28286837 -83.38487864    b'Bank Of America, Wayne'   b'poi'
b'1234567891'   b'Thu Mar 14 19:39:16 +0000 2013'   b'here is a sample tweet \xf0\x9f\x8f\x80 #notingoodhands'  b'userid2'  b'username2'

Here is the code that I'm using to write the data out.

Click to copy

out = open(filename, 'w')
   out.write('id\tcreated\ttext\tscreen name\tname\tlatitude\tlongitude\tplace name\tplace type')
   out.write('\n')
   rows = zip(ids, times, texts, screen_names, names, lats, lons, place_names, place_types)
   from csv import writer
   csv = writer(out, dialect='excel', delimiter = '\t')
   for row in rows:
       values = [(value.encode('utf-8') if hasattr(value, 'encode') else value) for value in row]
       csv.writerow(values)
   out.close()

So here's the thing. If i did this without the utf-8 bit and just output it straight, the formatting would be perfectly how i want it. But then when people type in special characters, the program crashes and isn't able to handle it.

Click to copy

Traceback (most recent call last):
  File "tweets.py", line 34, in <module>
    csv.writerow(values)
  File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3c0' in position 153: character maps to <undefined>

Adding the utf-8 bit converts it to the type of output you see here, but then it adds all these characters to the output. Does anyone have any thoughts on this?

522

asked Mar 14 '13 21:03

brian

1 Answers

You are writing byte data instead of unicode to your files, because you are encoding the data yourself.

Remove the encode calls altogether and let Python handle this for you; open the file with the UTF8 encoding and the rest takes care of itself:

Click to copy

out = open(filename, 'w', encoding='utf8')

This is documented in the csv module documentation:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

Click to copy
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
         print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.

188

answered Oct 05 '22 15:10

Martijn Pieters

Related questions
                            
                                passing variables from python to bash shell script via os.system
                            
                                igraph: why is add_edge function so slow ompared to add_edges?
                            
                                Popen.returncode not working in Python?
                            
                                Python while loops
                            
                                App Engine: Structured Property vs Reference Property for one-to-many relationship
                            
                                Not exporting functions from Python module
                            
                                Rail Fence Cipher- Looking for a better solution
                            
                                Understanding Virtual Environment for Python
                            
                                Behavior of "and" with sets in Python
                            
                                How to call Excel VBA functions and subs using Python win32com?
                            
                                Get pip to work with git and github repository
                            
                                Is there's any python library to output dictionary in beautiful ascii table?
                            
                                python: lower() german umlauts
                            
                                python list of dictionaries find duplicates based on value
                            
                                Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
                            
                                Python Regex and the Copyright Symbol
                            
                                Recursion and Helper Function
                            
                                How to fix localflavor deprecation warning in django 1.5?
                            
                                classifying a series to a new column in pandas
                            
                                How to compare list values with dictionary keys and make a new dictionary of it using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The Python CSV writer is adding letters to the beginning of each element and issues with encode

Tags:

python

csv

unicode

brian

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us