Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str encoding from latin-1 to utf-8 arbitrarily

I have some code that grabs strings from one environment and reproduces them in another. I am using Python 3.5. I keep running into this kind of error:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 112: Body ('–') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

...and I want to avoid it. This error is coming from the requests module. The problem is that I am dealing with literally tens of thousands of strings and new ones are added all the time. People are cutting and pasting from Excel and whatnot - and have no idea what characters I will bump into so I can't just run a str.replace(). I would like to make sure that every string I get from environment 1 is properly utf-8 encoded before I send it to environment 2.

I tried str('yadayada').encode('utf-8).decode('utf-8) and that didn't work. I tried str('yadaya', 'utf-8') and that didn't work. I tried declaring "# -*- coding: UTF-8 -*-" and that didn't work.

like image 770
Daniel Dow Avatar asked Dec 08 '16 01:12

Daniel Dow


People also ask

What is encoding =' Latin-1?

ISO 8859-1 is the ISO standard Latin-1 character set and encoding format. CP1252 is what Microsoft defined as the superset of ISO 8859-1. Thus, there are approximately 27 extra characters that are not included in the standard ISO 8859-1.

Why do we use encoding Latin-1 in Python?

The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.


1 Answers

In Python3 the standard string is utf-8 so there is no encoding like in python2. The problem with requests attempts to auto-encode the data for transfer. And fallback is latin1 (or perhaps just first 127 characters of it). In order to give requests enough information, you should encode it.

headers = {'Content-Type': 'text/text; charset=utf-8'}
requests.post(url,data = text.encode('utf-8'), headers = headers)
like image 56
Marek Grác Avatar answered Oct 19 '22 05:10

Marek Grác