I need to parse a DSL
file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.
It looks like:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
Now I see the only option to parse this file using regexps
. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.
I can't use special xml
and html
parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html
and xml
.
What is the best way to parse a file in such a format? Is there any Python library for that purpose?
"some engine which allows to create a tree basing on nesting tag structure".
Look at http://www.dabeaz.com/ply/
You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.
If you don't like that one, here's a list of alternatives.
http://wiki.python.org/moin/LanguageParsing
Using RegExp for this for something other than trivial use will give heartache and pain.
If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML
If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With