I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:
The steps I've taken so far:
Imported the data into a python list:
import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)
Selected a single article to test, and concatenated the body text into a single string:
txt = ' '.join(data[10000]['body'])
Loaded a French sentence tokenizer and split the string into a list of sentences:
nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]
Attempted to split the sentences into words using the WhiteSpaceTokenizer:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]
This is where I'm stuck, for the following reasons:
For English, I could tag and chunk the text like so:
tagged = [nltk.pos_tag(token) for token in tokens]
chunks = nltk.batch_ne_chunk(tagged)
My main options (in order of current preference) seem to be:
If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?
If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?
What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?
POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
chunk package. Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”.
with the word_tokenize() function. Then the tokens are POS tagged with the function pos_tag() .
There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.
As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.
It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger
I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.
Here are some suggestions:
WhitespaceTokenizer
is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer
, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re
module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With