It's been some time now that I am trying to get myself to write a parser in Javascript for org-mode. I had no trouble at all parsing the outline (which I did in a few minutes), but parsing the actual content is far more difficult, and I'm having trouble with imbricated lists, for example.
* This is a heading
P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
or a greater one for that matter.
+ LI1.1 I am beginning a list here
+ LI1.2 Here begins another list item
which continues here
and also here
P2 but is broken here (this line becomes a paragraph
outside of the first list).
+ LI2.1 P1 Second list item.
- LI2.1.1 Inner list with a simple item
- LI2.1.2 P1 and with an item containing several paragraphs.
Here is the second line in the item, and now
LI2.1.2 P2 I begin a new paragraph still in the same item.
The indentation can be only higher
LI2.1 P2 but if the indentation is lower, it breaks the item,
(and the whole list), and this is a paragraph in the LI2.1
list item
- LI 2.2.1 You get the picture
P3 Just plain text outside of the list.
(In the above example, the PX
and LIX.Y
are only there to show explicitly the beginning of new blocks, they would not be present in the actual document. P
stand for paragraph and LI
for list item. In the HTML world, PX would be the beginning of a <p>
tag. The numbering are just to help keep track of the nesting and changes of list.)
I wondered about the strategy to parse this kind of significant white-space imbricated blocks, clearly I can parse line by line without any backtracking or nothing, so it must be quite simple, but for some reason I couldn't manage to do it. I tried to get inspiration from Markdown parsers, or such things that are supposed to have similar imbrication features but they appeared to me (for the ones I saw) to be very hacky, full of regexes and I hoped I could write something cleaner (org-mode "grammar" being quite huge when you come to think about it, it will grow little by little and I'd like the whole thing to be maintainable and allow to plug-in new features easily).
Can anyone with experience in parsing such things can give me some pointers?
There is a Javascript org-mode parser available here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With