Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python:Got \xa0 instead of space in CSV and cannot remove or convert

I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.

I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.

I used

with io.open(train_fname) as f:
for line in f:
    line = line.encode("ascii", "replace")

But it is not working, I always get the following output.

Imagine being able say, you know what, no sanctions, no forever hearings on IEAA regulations, no more hiding\xa0under\xa0the pretense of friendly nuclear energy. \xa0You have 2 days to; \xa0i.e. \xa0let in the inspectors, quit killing the civilians.

I tried other methods like

line.replace(u"\xa0", " ") It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text. I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.

Does this mean the

\xa0

is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?

I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear. Thank you very much!

`

like image 453
Dexter Ju Avatar asked May 29 '16 18:05

Dexter Ju


1 Answers

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")

or:

line.replace(r"\xa0", " ")

The r in front of the string means to interpret each character literally, even a backslash.


Note that the data in the CSV file is full of inconsistencies. Examples:

  • \n probably means a linebreak.
  • \\n also appears, and it probably means a linebreak also.
  • \xa0 is a nonbreaking space, encoded in ISO-8859-1.
  • \xc2\xa0 is a nonbreaking space, encoded in UTF-8.
  • \\xc2\\xa0 also appears, with the same meaning.
  • \\\\n also appears.

So to get meaningful content out of that file, you should repeatedly interpret the escape sequences until nothing changes. After that, try to interpret the resulting byte sequence as UTF-8. If it works, fine. If not, interpret it as Codepage 1252 (which is a superset of ISO-8859-1).

like image 135
Roland Illig Avatar answered Nov 14 '22 22:11

Roland Illig