Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split \xef\xbb\xbf in a list read from a file [duplicate]

I tried to read big data file.txt and split all the comma, point, etc, so I read the file with this code in Python:

file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
    line = i[:-1].split(" ")
    for word in line:
        for j in word:
            word = re.sub('[\!@#$%^&*-/,.;:]','',word)
            word.lower()
        if word not in stopwords.words('spanish'):
            importantWords.append(word)
print importantWords

and it printed ['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn'].

How can I clean that \xef\xbb\xbf? I'm using Python 2.7.

like image 437
Bakke Medina Avatar asked Mar 13 '23 21:03

Bakke Medina


1 Answers

It's UTF-8 encoded BOM.

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

You can use codecs.open with encoding='utf-8-sig' to skip the BOM sequence:

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

SIDENOTE: Instead of using file.readlines, just iterate over the file. file.readlines will create unnecessary temporary list if what you want is just iterate over the file.

like image 83
falsetru Avatar answered Mar 23 '23 10:03

falsetru