Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I tag and chunk French text using NLTK and Python?

Tags:

python

nlp

nltk

I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:

  • Identify important entities (people, places, concepts)
  • Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)

The steps I've taken so far:

  1. Imported the data into a python list:

    import json
    json_articles=open('articlefile.json')
    articlelist = json.load(json_articles)
    
  2. Selected a single article to test, and concatenated the body text into a single string:

    txt =  ' '.join(data[10000]['body'])
    
  3. Loaded a French sentence tokenizer and split the string into a list of sentences:

    nltk.data.load('tokenizers/punkt/french.pickle')
    tokens = [french_tokenizer.tokenize(s) for s in sentences]
    
  4. Attempted to split the sentences into words using the WhiteSpaceTokenizer:

    from nltk.tokenize import WhitespaceTokenizer
    wst = WhitespaceTokenizer()
    tokens = [wst.tokenize(s) for s in sentences]
    

This is where I'm stuck, for the following reasons:

  • NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
  • Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning

For English, I could tag and chunk the text like so:

    tagged = [nltk.pos_tag(token) for token in tokens]
    chunks = nltk.batch_ne_chunk(tagged)

My main options (in order of current preference) seem to be:

  1. Use nltk-trainer to train my own tagger and chunker.
  2. Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
  3. Use a different tool altogether.

If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?

If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?

What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?

like image 721
Rahim Avatar asked Mar 12 '12 08:03

Rahim


People also ask

How do you tag words in NLTK?

POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.

What is chunking in NLTK?

chunk package. Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”.

Which NLTK function is used for POS tagging?

with the word_tokenize() function. Then the tokens are POS tagged with the function pos_tag() .


3 Answers

There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.

like image 171
gaborous Avatar answered Sep 23 '22 11:09

gaborous


As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.

It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger

I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.

like image 26
Rahim Avatar answered Sep 23 '22 11:09

Rahim


Here are some suggestions:

  1. WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.

  2. Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.

  3. The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.

  4. Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.

  5. You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.

  6. If you're not already on the nltk-users mailing list, I think you'll want to get on it.

like image 40
alexis Avatar answered Sep 22 '22 11:09

alexis