I have a .txt file of a Danish WordNet. Is there any way to use this with an NLP library for Python such as NLTK? If not, how might you go about natural language processing in a language that is not supported by a given library. Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?
WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition.
WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles. What are the unique beginners/what is in noun.tops?
WordNet provides a C API to use WordNet from a C program. The API documentation is available online and is distributed with the main WordNet packages. Interfaces for many other languages are available via our related projects page. What is the format for the query string to the web interface?
The morphological component of the WordNet library is unidirectional. Along with a set of irregular forms (e.g. children - child), it uses a sequence of simple rules, stripping common English endings until it finds a word form present in WordNet. Furthermore, it assumes its input is a valid inflected form.
Is there any way to use this with an NLP library for Python such as NLTK?
You can do this with NLTK, though it's a little awkward.
You'll need to convert your WordNet corpus to the Open Multilingual Wordnet format, which is a simple tab-delimited format. Note they already have a Danish WordNet.
Then you should install the WordNet and Open Multilingual Wordnet corpora in NLTK if you haven't done so already. This will create a directory like ~/nltk_data/corpora/omw/
, with a subdirectory for each language file. You'll need to add your corpus by creating a directory for it and naming your file like this:
~/nltk_data/corpora/omw/xxx/wn-data-xxx.tab
xxx
can be anything, but it must be the same in both places. This filename pattern is hard-coded in NLTK here.
After that you can use your WordNet by specifying the xxx
as a lang
parameter. Here's an example from the documentation:
>>> wn.synset('dog.n.01').lemma_names('ita') # change 'ita' to 'xxx'
['cane', 'Canis_familiaris']
How might you go about natural language processing in a language that is not supported by a given library?
I've done this with Japanese frequently.
Some techniques look inside your tokens - that is, they check if a word is literally "say" or "be" or something. This is common with stemmers and lemmatizers for obvious reasons. Some systems use rules based on assumptions about how parts of speech interact in a given language (usually English). You might be able to translate these expectations to your language, but typically you just can't use these.
However, many useful techniques don't look inside your tokens at all - they just care whether two tokens are equal or not. These usually rely primarily on features like labels or collocation data. You might need to pre-tokenize your data, and you might want to train a generic language model on Wikipedia in the language, but that's it. Word vectors, NER, Document Similarity are example problems where lack of language support isn't usually an issue.
Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?
SpaCy provides a means of custom labelling for NER. Using it with an otherwise unsupported language is not documented and would be a bit tricky. However, since you don't need a full language model for NER, you can use an NER specific tool with labelled examples.
Here's some example training data for CRF++ based on the CoNLL format:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
He PRP B-NP
reckons VBZ B-VP
..
This kind of format is supported by several CRF or other NER tools. CRFSuite is one with a Python wrapper.
For this kind of data, the algorithm doesn't really care what's in the first column, so language support isn't an issue.
Hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With