Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Genia Tagger file not found error in Anaconda/NLTK

I need to perform text pre-processing tasks such as sentence splitting, tokenization and tagging using NLTK. I want to use GENIA tagger for tagging. I am using Anaconda version 3.10 and installed geniatagger by the following command.

python setup.py install

In the IPython console, the following I entered the following code.

import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')

The following error message appears when pressed Enter.

---------------------------------------------------------------------------
WindowsError                              Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
  2 print tagger.parse('Welcome to natural language processing!')
  3 

 C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
 19         self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
 20                                         cwd=self._dir_to_tagger,
 ---> 21                                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 22 
 23     def parse(self, text):

 C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708                                 p2cread, p2cwrite,
709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
711         except Exception:
712             # Preserve original exception in case os.close raises.

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956                                          env,
957                                          cwd,
--> 958                                          startupinfo)
959             except pywintypes.error, e:
960                 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] The system cannot find the file specified

Why do I get this error message? How can I fix this?

If I use this tagging straight away, will it perform the tokenization part as well?

Note: geniatagger python file is inside the 'geniatagger' folder.

like image 814
Dakshila Kamalsooriya Avatar asked Nov 27 '25 20:11

Dakshila Kamalsooriya


1 Answers

TL;DR:

# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.

>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]

I'm not sure whether the packages for Genia tagger works out of the box from conda, so i think a native python/pip fix is simpler.

Firstly, there's no support for Genia Tagger in NLTK (At least not yet =) ), so it isn't a problem with the NLTK installation/modules.

The problem might lie in some outdated imports that the original GeniaTagger C code uses (http://www.nactem.ac.uk/tsujii/GENIA/tagger/).

So to resolve the problem, you have to add #include <cstdlib> to the original code but thankfully @saffsd has already done so and put it nicely in his github repo (https://github.com/saffsd/geniatagger/blob/master/morph.cpp)

Then comes installing the python wrapper, you can either:

  • install from the official pypi with: pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz

  • or use some other github repo to install, e.g. https://github.com/informationsea/geniatagger-python that appears first from google search

Lastly, the GeniaTagger initialization in python is rather weird because it doesn't really take the path to the directory of the tagger but the tagger itself and assumes that the model files are in the same directory as the tagger, see https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 .

And possibly it expects some use of './' in the first level of directory path, so you would have to initialize the tagger as such GeniaTagger('./geniatagger/geniatagger').


Beyond the installation issues. If you use the python wrapper for the GeniaTagger, there's only one function in the GeniaTagger object, i.e. parse(), when you use parse(), it will output a list of tuples for each sentence and the input is one sentence string. The items in each tuple are:

  • token (surface word)
  • lemma (see Stemmers vs Lemmatizers)
  • POS tag (looks like Penn Treebank tagset, see What are all possible pos tags of NLTK?)
  • Noun chunk (see Output results in conll format (POS-tagging, stanford pos tagger))
  • Named Entity chunk
like image 134
alvas Avatar answered Nov 29 '25 09:11

alvas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!