Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse blocks based on indentation

I'm writing a translator from a Markdown-like markup to HTML. I have completed the script, except for ordered/unordered list translation. I want to format lists based on significant whitespace (aka off-side rule). Example valid input is like this:

:: List item 
   top level
 :: List item level 2
 :: List item level 2
    :: List item level 3
      :: List item level 4
 :: List item level 2

:: List item top level

:: denotes a list item. Indentation levels might be arbitary. Tabs are not significant. I have been working on solutions on paper, but I couldn't figure out a way to implement. How should I go about this?

P.S: As long as it is more than one, any arbitary amout of spaces denotes a new level, like in python.

I'm using python to implement this, but I'm not looking for code. I want explanation of how to do. And preferably I want to implement the complete thing myself, without any libraries. I'm going to use this markup for my jekyll blog, but this is more than a little tool for me, I want to learn as much as I can about regular expressions and parsing from this project. Thanks in advance.


1 Answers

@delnan's link to the Python reference provides a good approach, but (as the reference itself suggests) Python allows correct indentation that is also confusing to read and (if you try to take advantage of its full liberality) potentially tricky to debug.

For your application, it might be less confusing for the user if you required each unique number of indenting spaces to indicate a different list level. For those semantics, you can find the levels for the list in no more than four lines of Python 3. You didn't want to see a solution in code (though I'd be happy to post it if you'd like) so my approach was roughly as follows:

  1. count the number of spaces at the start of each line of the list (which doesn't need a regular expression).
  2. create a set and sort it to give a list of the number of indenting spaces used for each level of this list, ordered from least to most.
  3. create a dictionary that relates the number of indenting spaces used in each case to a list level.
  4. refer to that dictionary using the number of spaces at the start of each line of the list, which gives the list level for each line.

(EDITED to include the code and to handle multi-line list items)

Given:

:: List item
   (this is the second line of the first list item)
 :: List item level 2
 :: List item level 2
    :: List item level 3
      :: List item level 4
 :: List item level 2
:: List item top leve

... the function below produces the list:

:: List item (this is the second line of the first list item)
 :: List item level 2
 :: List item level 2
  :: List item level 3
   :: List item level 4
 :: List item level 2
:: List item top level

... which I think was the intended result for this test case.

Here's the code, written to accept the list from standard input:

import sys

def findIndent (lst):
    # given a list of text strings, returns a list containing the
    # indentation levels for each string
    spcCount = [len(s)-len(s.lstrip(' ')) for s in lst]
    indent = sorted(set(spcCount))
    levelRef = {indent[i]:i for i in range(len(indent))}
    return [levelRef[i]+1 for i in spcCount]

lst = []
for li in sys.stdin:
    if li.lstrip(' ').find('::') == 0:
        lst.append(li.rstrip())
    else:
        lst[-1] = lst[-1].rstrip() + ' ' + li.lstrip(' ').rstrip()

for i,li in zip(findIndent(lst),lst):
    print (' '*i + li.lstrip())
like image 86
Simon Avatar answered Jan 01 '26 10:01

Simon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!