How to parse hierarchy based on indents with python

Question

I have an accounting tree that's stored with indents/spaces in the source:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

There are a fixed number of levels, so I'd like to flatten the hierarchy by using 3 fields (actual data has 6 levels, simplified for example):

L1       L2            L3
Income
Income   Revenue
Income   Revenue       IAP
Income   Revenue       Ads
Income   Other-Income
Expenses Developers    In-house
 ... etc

I can do this by checking the number of spaces prior to the account name:

for rownum in range(6,ws.max_row+1):
   accountName = str(ws.cell(row=rownum,column=1).value)
   indent = len(accountName) - len(accountName.lstrip(' '))
   if indent == 0:
      l1 = accountName
      l2 = ''
      l3 = ''
   elif indent == 3:
      l2 = accountName
      l3 = ''
   else:
      l3 = accountName

   w.writerow([l1,l2,l3])

Is there a more flexible way to achieve this based on the indentation of the current row compared to the previous row rather than assuming it's always 3 spaces per level? L1 will always have no indent, and we can trust that lower levels will be indented further than their parent, but maybe not always 3 spaces per level.

Update, ended up with this as the meat of the logic, since I ultimately wanted the account list with the content, it seemed easiest to just use the indent to decide whether to reset, append, or pop the list:

        if indent == 0:
            accountList = []
            accountList.append((indent,accountName))
        elif indent > prev_indent:
            accountList.append((indent,accountName))
        elif indent <= prev_indent:
            max_indent = int(max(accountList,key=itemgetter(0))[0])
            while max_indent >= indent:
                accountList.pop()
                max_indent = int(max(accountList,key=itemgetter(0))[0])
            accountList.append((indent,accountName))

So at each row of output the accountList is complete.

Right leg · Accepted Answer

You can mimick the way Python actually parses the indentation. First, create a stack that will contain the indentation levels. At each line:

If the indentation is bigger than the top of the stack, push it and increase the depth level.
If it is the same, continue at the same level.
If it is lower, pop the top of the stack while it is higher than the new indentation. If you find a lower indentation level before finding exactly the same, then there is an indentation error.

indentation = []
indentation.append(0)
depth = 0

f = open("test.txt", 'r')

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    print(f"{content} (depth: {depth})")

With a "test.txt" file whose content is as you provided:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

Here is the output:

Income (depth: 0)
Revenue (depth: 1)
IAP (depth: 2)
Ads (depth: 2)
Other-Income (depth: 1)
Expenses (depth: 0)
Developers (depth: 1)
In-house (depth: 2)
Contractors (depth: 2)
Advertising (depth: 1)
Other Expense (depth: 1)

So, what can you do with this? Suppose you want to build nested lists. First, create a data stack.

When you find an indentation, append a new list at the end of the data stack.
When you find an unindentation, pop the top list, and append it to the new top.

And regardless, for each line, append the content to the list at the top of the data stack.

Here is the corresponding implementation:

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)
        data.append([])

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()
            top = data.pop()
            data[-1].append(top)

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    data[-1].append(content)

while len(data) > 1:
    top = data.pop()
    data[-1].append(top)

Your nested list is at the top of your data stack. The output for the same file is:

['Income',
    ['Revenue',
        ['IAP',
         'Ads'
        ],
     'Other-Income'
    ],
 'Expenses',
    ['Developers',
        ['In-house',
         'Contractors'
        ],
     'Advertising',
     'Other Expense'
    ]
 ]

This is rather easy to manipulate, although quite deeply nested. You can access the data by chaining the item accesses:

>>> l = data[0]
>>> l
['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']]
>>> l[1]
['Revenue', ['IAP', 'Ads'], 'Other-Income']
>>> l[1][1]
['IAP', 'Ads']
>>> l[1][1][0]
'IAP'

How to parse hierarchy based on indents with python

Tags:

python

Hart CO

1 Answers

Right leg

Recent Activity

Donate For Us

How to parse hierarchy based on indents with python

Tags:

python

Hart CO

1 Answers

Right leg

Related questions

Recent Activity

Donate For Us