Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing a Django UploadedFile as UTF-8 with universal newlines

Tags:

python

django

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).

For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):

  1. Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
  2. Do not support universal newlines (which probably the majority of the files uploaded to this system will need).

Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)

I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:

import csv
import codecs

class CSVParser:
    def __init__(self,file):
        # 'file' is assumed to be an InMemoryUploadedFile object.
        dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
        file.open() # seek to 0
        self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
                                 dialect=dialect)
        try:
            self.field_names = self.reader.next()
        except StopIteration:
            # The file was empty - this is not allowed.
            raise ValueError('Unrecognized format (empty file)')

        if len(self.field_names) <= 1:
            # This probably isn't a CSV file at all.
            # Note that the csv module will (incorrectly) parse ALL files, even
            # binary data. This will catch most such files.
            raise ValueError('Unrecognized format (too few columns)')

        # Additional methods snipped, unrelated to issue

Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.

The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.

EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

like image 447
eblume Avatar asked Jan 18 '11 22:01

eblume


2 Answers

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.

If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.

I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

like image 129
eblume Avatar answered Nov 12 '22 03:11

eblume


I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.

import csv as csv_mod
import codecs

file = request.FILES['file']    
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() 
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )
like image 3
wilblack Avatar answered Nov 12 '22 03:11

wilblack