I want to open my file.txt
and split all data from this file.
Here is my file.txt
:
some_data1 some_data2 some_data3 some_data4 some_data5
and here is my python code:
>>>file_txt = open("file.txt", 'r') >>>data = file_txt.read() >>>data_list = data.split(' ') >>>print data some_data1 some_data2 some_data3 some_data4 some_data5 >>>print data_list ['\xef\xbb\xbfsome_data1', 'some_data1', "some_data1", 'some_data1', 'some_data1\n']
As you can see here, when I print my data_list
it adds to my list this: \xef\xbb\xbf
and this: \n
. What are these and how can I clean my list from them.
Thanks.
The \xef\xbb\xbf is a Byte Order Mark for UTF-8 - the \x is an escape sequence indicating the next two characters are a hex sequence representing the character code. The \n is a new line character. To remove this, you can use rstrip() . data.rstrip() data_list = data.split(' ')
The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.
"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat BOM as file info. instead of a string.
Your file contains UTF-8 BOM in the beginning.
To get rid of it, first decode your file contents to unicode.
fp = open("file.txt") data = fp.read().decode("utf-8-sig").encode("utf-8")
But better don't encode it back to utf-8
, but work with unicode
d text. There is a good rule: decode all your input text data to unicode as soon as possible, and work only with unicode; and encode the output data to the required encoding as late as possible. This will save you from many headaches.
To read bigger files in a certain encoding, use io.open
or codecs.open
.
Also check this.
Use str.strip()
or str.rstrip()
to get rid of the newline character \n
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With