Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to add your own WordNet to a library?

I have a .txt file of a Danish WordNet. Is there any way to use this with an NLP library for Python such as NLTK? If not, how might you go about natural language processing in a language that is not supported by a given library. Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?

like image 218
sn3jd3r Avatar asked Feb 23 '17 17:02

sn3jd3r


People also ask

What is the use of WordNet?

WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition.

What words are not allowed in WordNet?

WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles. What are the unique beginners/what is in noun.tops?

Can I use WordNet from a C program?

WordNet provides a C API to use WordNet from a C program. The API documentation is available online and is distributed with the main WordNet packages. Interfaces for many other languages are available via our related projects page. What is the format for the query string to the web interface?

What is the morphological component of the WordNet library?

The morphological component of the WordNet library is unidirectional. Along with a set of irregular forms (e.g. children - child), it uses a sequence of simple rules, stripping common English endings until it finds a word form present in WordNet. Furthermore, it assumes its input is a valid inflected form.


1 Answers

Is there any way to use this with an NLP library for Python such as NLTK?

You can do this with NLTK, though it's a little awkward.

You'll need to convert your WordNet corpus to the Open Multilingual Wordnet format, which is a simple tab-delimited format. Note they already have a Danish WordNet.

Then you should install the WordNet and Open Multilingual Wordnet corpora in NLTK if you haven't done so already. This will create a directory like ~/nltk_data/corpora/omw/, with a subdirectory for each language file. You'll need to add your corpus by creating a directory for it and naming your file like this:

~/nltk_data/corpora/omw/xxx/wn-data-xxx.tab

xxx can be anything, but it must be the same in both places. This filename pattern is hard-coded in NLTK here.

After that you can use your WordNet by specifying the xxx as a lang parameter. Here's an example from the documentation:

>>> wn.synset('dog.n.01').lemma_names('ita') # change 'ita' to 'xxx'
['cane', 'Canis_familiaris']

How might you go about natural language processing in a language that is not supported by a given library?

I've done this with Japanese frequently.

Some techniques look inside your tokens - that is, they check if a word is literally "say" or "be" or something. This is common with stemmers and lemmatizers for obvious reasons. Some systems use rules based on assumptions about how parts of speech interact in a given language (usually English). You might be able to translate these expectations to your language, but typically you just can't use these.

However, many useful techniques don't look inside your tokens at all - they just care whether two tokens are equal or not. These usually rely primarily on features like labels or collocation data. You might need to pre-tokenize your data, and you might want to train a generic language model on Wikipedia in the language, but that's it. Word vectors, NER, Document Similarity are example problems where lack of language support isn't usually an issue.

Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?

SpaCy provides a means of custom labelling for NER. Using it with an otherwise unsupported language is not documented and would be a bit tricky. However, since you don't need a full language model for NER, you can use an NER specific tool with labelled examples.

Here's some example training data for CRF++ based on the CoNLL format:

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

He        PRP  B-NP
reckons   VBZ  B-VP
..

This kind of format is supported by several CRF or other NER tools. CRFSuite is one with a Python wrapper.

For this kind of data, the algorithm doesn't really care what's in the first column, so language support isn't an issue.

Hope that helps!

like image 87
polm23 Avatar answered Sep 28 '22 05:09

polm23