I tried to read big data file.txt and split all the comma, point, etc, so I read the file with this code in Python:
file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
line = i[:-1].split(" ")
for word in line:
for j in word:
word = re.sub('[\!@#$%^&*-/,.;:]','',word)
word.lower()
if word not in stopwords.words('spanish'):
importantWords.append(word)
print importantWords
and it printed ['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn']
.
How can I clean that \xef\xbb\xbf
? I'm using Python 2.7.
It's UTF-8 encoded BOM.
>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
You can use codecs.open
with encoding='utf-8-sig'
to skip the BOM sequence:
with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
for line in f:
...
SIDENOTE: Instead of using file.readlines
, just iterate over the file. file.readlines
will create unnecessary temporary list if what you want is just iterate over the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With