Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python ntlk donwload gives parser eror

Tags:

python

nltk

I am trying to run the following command

import nltk
nltk.download('all')

But I am getting this error

Traceback (most recent call last):
  File "./update.py", line 3, in <module>
    nltk.download('all')
  File "/usr/lib/python3.6/site-packages/nltk/downloader.py", line 664, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/usr/lib/python3.6/site-packages/nltk/downloader.py", line 534, in incr_download
    try: info = self._info_or_id(info_or_id)
  File "/usr/lib/python3.6/site-packages/nltk/downloader.py", line 508, in _info_or_id
    return self.info(info_or_id)
  File "/usr/lib/python3.6/site-packages/nltk/downloader.py", line 875, in info
    self._update_index()
  File "/usr/lib/python3.6/site-packages/nltk/downloader.py", line 825, in _update_index
    ElementTree.parse(compat.urlopen(self._url)).getroot())
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 23, column 143

I am new to python, so I am not really sure what should I do. I looked into the source module reported above and noticed that it is trying to download the xml file. So i ran the below command and did not give me any error.

compat.urlopen('https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml')

So I presume there is no issue in the download, but in the parser. Can someone suggest how do I proceed from here?

like image 423
user3602300 Avatar asked Apr 14 '17 13:04

user3602300


People also ask

What is NLTK download (' Punkt ')?

punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk. download('punkt') .


2 Answers

index.xml had a typo. It is already patched. Just checked and nltk.download('all') works fine!

see: nltk/nltk_data#70

like image 56
Skod Avatar answered Oct 01 '22 20:10

Skod


The problem is with the XML that NLTK has returned.

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 23, column 143

At 23:143 we see the problem, a missing '=':

... unzip="1" unzipped_size"1917" url="https...

NTLK will surely fix this soon, until then I'm not sure what the best response is.

like image 41
dbug12 Avatar answered Oct 01 '22 21:10

dbug12