Is it possible to add your own WordNet to a library?

Tags:

I have a .txt file of a Danish WordNet. Is there any way to use this with an NLP library for Python such as NLTK? If not, how might you go about natural language processing in a language that is not supported by a given library. Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?

218

asked Feb 23 '17 17:02

sn3jd3r

1 Answers

Is there any way to use this with an NLP library for Python such as NLTK?

You can do this with NLTK, though it's a little awkward.

You'll need to convert your WordNet corpus to the Open Multilingual Wordnet format, which is a simple tab-delimited format. Note they already have a Danish WordNet.

Then you should install the WordNet and Open Multilingual Wordnet corpora in NLTK if you haven't done so already. This will create a directory like ~/nltk_data/corpora/omw/, with a subdirectory for each language file. You'll need to add your corpus by creating a directory for it and naming your file like this:

~/nltk_data/corpora/omw/xxx/wn-data-xxx.tab

xxx can be anything, but it must be the same in both places. This filename pattern is hard-coded in NLTK here.

After that you can use your WordNet by specifying the xxx as a lang parameter. Here's an example from the documentation:

>>> wn.synset('dog.n.01').lemma_names('ita') # change 'ita' to 'xxx'
['cane', 'Canis_familiaris']

How might you go about natural language processing in a language that is not supported by a given library?

I've done this with Japanese frequently.

Some techniques look inside your tokens - that is, they check if a word is literally "say" or "be" or something. This is common with stemmers and lemmatizers for obvious reasons. Some systems use rules based on assumptions about how parts of speech interact in a given language (usually English). You might be able to translate these expectations to your language, but typically you just can't use these.

However, many useful techniques don't look inside your tokens at all - they just care whether two tokens are equal or not. These usually rely primarily on features like labels or collocation data. You might need to pre-tokenize your data, and you might want to train a generic language model on Wikipedia in the language, but that's it. Word vectors, NER, Document Similarity are example problems where lack of language support isn't usually an issue.

Also say you want to do named entity recognition in a language other than English or Dutch in a library like spaCy. Is there any way to do this?

SpaCy provides a means of custom labelling for NER. Using it with an otherwise unsupported language is not documented and would be a bit tricky. However, since you don't need a full language model for NER, you can use an NER specific tool with labelled examples.

Here's some example training data for CRF++ based on the CoNLL format:

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

He        PRP  B-NP
reckons   VBZ  B-VP
..

This kind of format is supported by several CRF or other NER tools. CRFSuite is one with a Python wrapper.

For this kind of data, the algorithm doesn't really care what's in the first column, so language support isn't an issue.

Hope that helps!

answered Sep 28 '22 05:09

polm23

Related questions
                            
                                How to docstring in python for multiple languages
                            
                                Reloading packages (and their submodules) recursively in Python
                            
                                Why does "None in numpy.asarray(...)" cause a future warning
                            
                                numpy.savetxt() stop newline on final line
                            
                                Autobahn sending user specific and broadcast messages from external application
                            
                                How to catch `CParserError` when reading a CSV file
                            
                                PYTHON DLL load failed
                            
                                Using the absolute_sigma parameter in scipy.optimize.curve_fit
                            
                                Haystack says “Model could not be found for SearchResult”
                            
                                Can I turn off Python (PiP) SSL cert validation with an ENV variable?
                            
                                Dump data from malformed SQLite in Python
                            
                                Storing pandas DataFrame with mixed data and category into hdf5
                            
                                How to subclass list and trigger an event whenever the data change?
                            
                                What's the command to "reset" a bokeh plot?
                            
                                Re-compose a Tensor after tensor factorization
                            
                                How to run only unmarked tests in pytest
                            
                                Using python together with knitr
                            
                                Why modifying dict during iteration doesn't always raise exception?
                            
                                Jupyter, Interactive Matplotlib: Hide the toolbar of the interactive view
                            
                                Slow recursion in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to add your own WordNet to a library?

Tags:

python

machine-learning

nlp

nltk

spacy

sn3jd3r

People also ask

1 Answers

polm23

Recent Activity

Donate For Us