I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals: <ul> <li>Identify important entities (people, places, concepts)</li> <li>Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)</li> </ul> The steps I've taken so far: <ol> <li> Imported the data into a python list: <pre class="prettyprint"><code>import json json_articles=open('articlefile.json') articlelist = json.load(json_articles) </code></pre> </li> <li> Selected a single article to test, and concatenated the body text into a single string: <pre class="prettyprint"><code>txt = ' '.join(data[10000]['body']) </code></pre> </li> <li> Loaded a French sentence tokenizer and split the string into a list of sentences: <pre class="prettyprint"><code>nltk.data.load('tokenizers/punkt/french.pickle') tokens = [french_tokenizer.tokenize(s) for s in sentences] </code></pre> </li> <li> Attempted to split the sentences into words using the WhiteSpaceTokenizer: <pre class="prettyprint"><code>from nltk.tokenize import WhitespaceTokenizer wst = WhitespaceTokenizer() tokens = [wst.tokenize(s) for s in sentences] </code></pre> </li> </ol> This is where I'm stuck, for the following reasons: <ul> <li>NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.</li> <li>Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning</li> </ul> For English, I could tag and chunk the text like so: <pre class="prettyprint"><code> tagged = [nltk.pos_tag(token) for token in tokens] chunks = nltk.batch_ne_chunk(tagged) </code></pre> My main options (in order of current preference) seem to be: <ol> <li>Use nltk-trainer to train my own tagger and chunker.</li> <li>Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.</li> <li>Use a different tool altogether.</li> </ol> If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank? If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format? What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?

As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French. It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.

Here are some suggestions: <ol> <li><code>WhitespaceTokenizer</code> is doing what it's meant to. If you want to split on apostrophes, try <code>WordPunctTokenizer</code>, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the <code>re</code> module.</li> <li>Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.</li> <li>The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use. </li> <li>Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.</li> <li>You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.</li> <li>If you're not already on the nltk-users mailing list, I think you'll want to get on it.</li> </ol>

How can I tag and chunk French text using NLTK and Python?

Tags:

python

nlp

nltk

I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:

Identify important entities (people, places, concepts)
Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)

The steps I've taken so far:

Imported the data into a python list:

import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)

Selected a single article to test, and concatenated the body text into a single string:
```
txt =  ' '.join(data[10000]['body'])
```

Loaded a French sentence tokenizer and split the string into a list of sentences:

nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]

Attempted to split the sentences into words using the WhiteSpaceTokenizer:

from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]

This is where I'm stuck, for the following reasons:

NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning

For English, I could tag and chunk the text like so:

    tagged = [nltk.pos_tag(token) for token in tokens]
    chunks = nltk.batch_ne_chunk(tagged)

My main options (in order of current preference) seem to be:

Use nltk-trainer to train my own tagger and chunker.
Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
Use a different tool altogether.

If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?

If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?

What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?

721

asked Mar 12 '12 08:03

Rahim

3 Answers

There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.

171

answered Sep 23 '22 11:09

gaborous

As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.

It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger

I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.

answered Sep 23 '22 11:09

Rahim

Here are some suggestions:

WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.

answered Sep 22 '22 11:09

alexis

Related questions
                            
                                How to receive mail using python
                            
                                Encoding for Multilingual .py Files
                            
                                Subclassing numpy ndarray problem
                            
                                Initialization of unit-test in PyDev?
                            
                                Google App Engine: task_retry_limit doesn't work?
                            
                                Python on iPhone
                            
                                Saving the state of a program to allow it to be resumed [duplicate]
                            
                                Python monitor serial port (RS-232) handshake signals
                            
                                best practice for passing values between functions in Python
                            
                                Monitor events in a filesystem as they happen
                            
                                Jinja install for python
                            
                                How does cgi.FieldStorage store files?
                            
                                Using reverse() in django forms
                            
                                style, formatting the slice operator
                            
                                Comparing PHP's __get() with __get__() and __getattr__() in Python
                            
                                customizing admin of django to have dependent select fields
                            
                                Python logging: how to represent newlines in the format string in a logging config file?
                            
                                Difference between mutation, rebinding, copying value, and assignment operator
                            
                                Python - how to read file with NUL delimited lines?
                            
                                An efficiently stored dictionary. Does this data structure exist and what is it called?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I tag and chunk French text using NLTK and Python?

Tags:

python

nlp

nltk

Rahim

People also ask

3 Answers

gaborous

Rahim

alexis

Recent Activity

Donate For Us