I am trying to extract Indonesia titles from a wiki titles dump that's in a text file using word2vec-gensim in Python 3. The wiki dump contains titles in other languages also and some symbols. Below is my code:
if len(sys.argv) != 3:
namaFileInput = "idwiki-latest-pages-articles.xml.bz2"
namaFileOutput = "wiki.id.case.text"
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(namaFileOutput, 'w')
# lower=False: huruf kecil dan besar dibedakan
wiki = WikiCorpus(namaFileInput, lemmatize=False, dictionary={}, lower=False)
for text in wiki.get_texts():
if six.PY3:
output.write(b' '.join(text).encode('utf-8') + '\n')
else:
output.write(space.join(text) + "\n")
i = i + 1
if i % 10000 == 0:
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
But I am getting and error:
TypeError Traceback (most recent call last)
<ipython-input-17-d4c686a9093a> in <module>
29 for text in wiki.get_texts():
30 if six.PY3:
---> 31 output.write(b' '.join(text).encode('utf-8') + '\n')
32 else:
33 output.write(space.join(text) + "\n")
TypeError: sequence item 0: expected a bytes-like object, str found
I have searched online but could not succeed. Any help will be appreciated.
The problem is in this line of code: b' '.join(text)
The variable text
is a list of strings but you are trying to join them with a byte-string (namely b' '
), which doesn't work. You can solve this problem by simply removing b
since you are converting the string to bytes afterwards anyway with the .encode('utf-8')
method:
output.write(' '.join(text).encode('utf-8') + '\n')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With