Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fault-tolerant python based parser for WikiLeaks cables

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.

A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.

We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).

What kind of parser or other kind of solution would you suggest? Is something better around?

like image 782
qbi Avatar asked May 15 '11 08:05

qbi


1 Answers

Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/

like image 151
michk Avatar answered Sep 23 '22 15:09

michk