Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a text file with a special markup

I need to parse a DSL file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.

It looks like:

activate
    [m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
    [m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
    [m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
    [m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
    {{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
    {{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}

Now I see the only option to parse this file using regexps. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.

I can't use special xml and html parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html and xml.

What is the best way to parse a file in such a format? Is there any Python library for that purpose?

like image 887
ovgolovin Avatar asked Nov 05 '22 14:11

ovgolovin


2 Answers

"some engine which allows to create a tree basing on nesting tag structure".

Look at http://www.dabeaz.com/ply/

You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.

If you don't like that one, here's a list of alternatives.

http://wiki.python.org/moin/LanguageParsing

like image 186
S.Lott Avatar answered Nov 15 '22 06:11

S.Lott


Using RegExp for this for something other than trivial use will give heartache and pain.

If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML

If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.

like image 25
dawg Avatar answered Nov 15 '22 06:11

dawg