Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a temporary file with Unicode encoding?

Tags:

When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).

Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:

#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
  fh.write(u"Hello World: ä")
  fh.seek(0)
  for line in fh:
    print line

How can I create a temporary file with Unicode encoding in Python?

Edit:

  1. I am using Linux and the error message that I get for this code is:

    Traceback (most recent call last):
      File "tmp_file.py", line 5, in <module>
        fh.write(u"Hello World: ä")
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
    
  2. This is just an example. In practice I am trying to write a string that some API returned.
like image 397
dbarbosa Avatar asked May 08 '12 00:05

dbarbosa


People also ask

How do I create a temporary csv file in Python?

Creating a Temporary FileThe file is created using the TemporaryFile() function. By default, the file is opened in w+b mode, that is, we can both read and write to the open file. Binary mode is used so that files can work with all types of data. This file may not have a proper visible name in the file system.

How do I create a temp folder in Python?

TemporaryDirectory() This function creates a temporary directory. You can choose the location of this temporary directory by mentioning dir parameter. Following statement will create a temporary directory in C:\python36 folder.


2 Answers

Everyone else's answers are correct, I just want to clarify what's going on:

The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.

First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.

To get a Unicode string from a byte string, you call the .encode() method:

>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True

Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.

If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).

The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.

like image 187
dfb Avatar answered Sep 22 '22 15:09

dfb


tempfile.TemporaryFile has encoding option in Python 3:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
  fh.write("Hello World: ä")
  fh.seek(0)
  for line in fh:
    print(line)

Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.

If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
  fh.write(u"Hello World: ä".encode('utf-8'))
  fh.seek(0)
  for line in fh:
    print line.decode('utf-8')

Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!

like image 22
Seppo Enarvi Avatar answered Sep 18 '22 15:09

Seppo Enarvi