I need to perform text pre-processing tasks such as sentence splitting, tokenization and tagging using NLTK. I want to use GENIA tagger for tagging. I am using Anaconda version 3.10 and installed geniatagger by the following command.
python setup.py install
In the IPython console, the following I entered the following code.
import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')
The following error message appears when pressed Enter.
---------------------------------------------------------------------------
WindowsError Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
2 print tagger.parse('Welcome to natural language processing!')
3
C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
19 self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
20 cwd=self._dir_to_tagger,
---> 21 stdin=subprocess.PIPE, stdout=subprocess.PIPE)
22
23 def parse(self, text):
C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708 p2cread, p2cwrite,
709 c2pread, c2pwrite,
--> 710 errread, errwrite)
711 except Exception:
712 # Preserve original exception in case os.close raises.
C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956 env,
957 cwd,
--> 958 startupinfo)
959 except pywintypes.error, e:
960 # Translate pywintypes.error to WindowsError, which is
WindowsError: [Error 2] The system cannot find the file specified
Why do I get this error message? How can I fix this?
If I use this tagging straight away, will it perform the tokenization part as well?
Note: geniatagger python file is inside the 'geniatagger' folder.
TL;DR:
# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.
>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]
I'm not sure whether the packages for Genia tagger works out of the box from conda, so i think a native python/pip fix is simpler.
Firstly, there's no support for Genia Tagger in NLTK (At least not yet =) ), so it isn't a problem with the NLTK installation/modules.
The problem might lie in some outdated imports that the original GeniaTagger C code uses (http://www.nactem.ac.uk/tsujii/GENIA/tagger/).
So to resolve the problem, you have to add #include <cstdlib> to the original code but thankfully @saffsd has already done so and put it nicely in his github repo (https://github.com/saffsd/geniatagger/blob/master/morph.cpp)
Then comes installing the python wrapper, you can either:
install from the official pypi with: pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz
or use some other github repo to install, e.g. https://github.com/informationsea/geniatagger-python that appears first from google search
Lastly, the GeniaTagger initialization in python is rather weird because it doesn't really take the path to the directory of the tagger but the tagger itself and assumes that the model files are in the same directory as the tagger, see https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 .
And possibly it expects some use of './' in the first level of directory path, so you would have to initialize the tagger as such GeniaTagger('./geniatagger/geniatagger').
Beyond the installation issues. If you use the python wrapper for the GeniaTagger, there's only one function in the GeniaTagger object, i.e. parse(), when you use parse(), it will output a list of tuples for each sentence and the input is one sentence string. The items in each tuple are:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With