Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split function add: \xef\xbb\xbf...\n to my list

Tags:

python

split

I want to open my file.txt and split all data from this file.

Here is my file.txt:

some_data1 some_data2 some_data3 some_data4 some_data5 

and here is my python code:

>>>file_txt = open("file.txt", 'r') >>>data = file_txt.read() >>>data_list = data.split(' ') >>>print data some_data1 some_data2 some_data3 some_data4 some_data5 >>>print data_list ['\xef\xbb\xbfsome_data1', 'some_data1', "some_data1", 'some_data1', 'some_data1\n'] 

As you can see here, when I print my data_list it adds to my list this: \xef\xbb\xbf and this: \n. What are these and how can I clean my list from them.

Thanks.

like image 430
Michael Avatar asked Sep 06 '13 18:09

Michael


People also ask

What is b'\ xef xbb xbf?

The \xef\xbb\xbf is a Byte Order Mark for UTF-8 - the \x is an escape sequence indicating the next two characters are a hex sequence representing the character code. The \n is a new line character. To remove this, you can use rstrip() . data.rstrip() data_list = data.split(' ')

What is utf8 with BOM?

The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.

What is SIG utf8?

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat BOM as file info. instead of a string.


1 Answers

Your file contains UTF-8 BOM in the beginning.

To get rid of it, first decode your file contents to unicode.

fp = open("file.txt") data = fp.read().decode("utf-8-sig").encode("utf-8") 

But better don't encode it back to utf-8, but work with unicoded text. There is a good rule: decode all your input text data to unicode as soon as possible, and work only with unicode; and encode the output data to the required encoding as late as possible. This will save you from many headaches.

To read bigger files in a certain encoding, use io.open or codecs.open.

Also check this.

Use str.strip() or str.rstrip() to get rid of the newline character \n.

like image 80
warvariuc Avatar answered Oct 04 '22 17:10

warvariuc