Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse indented text tree in Java

Tags:

java

algorithm

I have an indented file that I need to parsed using java, I need some way to place this in a Section class as shown below

    root
     root1
       text1
         text1.1
         text1.2
       text2
         text2.1
         text2.2

     root2
       text1
         text1.1
         text1.2
       text2
         text2.1
         text2.2.2

I have the class for putting the indented stuff it looks like

public class Section 
{

    private List<Section> children;
    private String text;
    private int depth;
    public Section(String t)
    {
       text =t;
    }

    public List<Section> getChildren()
    {
        if (children == null)
      {
            children = new ArrayList<Section>();
       }
        return children;
}

public void setChildren(List<Section> newChildren)
{
    if (newChildren == null) {
        children = newChildren;
    } else {
        if (children == null) {
            children = new ArrayList<Section>();
        }
        for (Section child : newChildren) {
            this.addChild(child);
        }
    }
}

public void addChild(Section child)
{
    if (children == null) {
        children = new ArrayList<Section>();
    }
    if (child != null) {
        children.add(child);
    }
}

public String getText()
{
    return text;
}

public void setText(String newText)
{
    text =newText;
}
public String getDepth()
{
    return depth;
}

 public void setDepth(int newDepth)
 {
    depth = newDepth;
 }
}

I need some way to parse the file and place it in expected result which us a Section object which would look like below

Section= 

Text="Root"
Children
Child1: Text= "root1" 

        Child1: "text1"
            Child1="Text 1.1"
            Child2="Text 1.2"
        Child2: "text2"
            Child1="Text 2.1"
            Child2="Text 2.2"
            Children
Child2: Text= "root2" 
        Child1: "text1"
            Child1="Text 1.1"
            Child2="Text 1.2"
        Child2: "text2"
            Child1="Text 2.1"
            Child2="Text 2.2"


Here is some code that I have started
   int indentCount=0;
   while(String text = reader.readline()
   {
   indentCount=countLeadingSpaces(String word);
   //TODO create the section here
   }


public static int countLeadingSpaces(String word)
{
    int length=word.length();
    int count=0;

   for(int i=0;i<length;i++)
   {
       char first = word.charAt(i); 
        if(Character.isWhitespace(first))
        {
            count++;           
        }
        else
        {
            return count;
        }
   }

 return count;

}
like image 235
Helen Araya Avatar asked Feb 12 '14 17:02

Helen Araya


3 Answers

I added a parent pointer as well. Maybe the text can be parsed without it, but parent pointers make it easier. First of all, you need to have more constructors:

static final int root_depth = 4; // assuming 4 whitespaces precede the tree root

public Section(String text, int depth) {
    this.text     = text;
    this.depth    = depth;
    this.children = new ArrayList<Section>();
    this.parent   = null;
}

public Section(String text, int depth, Section parent) {
    this.text     = text;
    this.depth    = depth;
    this.children = new ArrayList<Section>();
    this.parent   = parent;
}

Then, when you start parsing the file, read it line by line:

Section prev = null;
for (String line; (line = bufferedReader.readLine()) != null; ) {
    if (prev == null && line begins with root_depth whitespaces) {
        Section root = new Section(text_of_line, root_depth);
        prev = root;
    }
    else {
        int t_depth = no. of whitespaces at the beginning of this line;
        if (t_depth > prev.getDepth())
            // assuming that empty sections are not allowed
            Section t_section = new Section(text_of_line, t_depth, prev);
            prev.addChild(t_section);
        }
        else if (t_depth == prev.getDepth) {
            Section t_section = new Section(text_of_line, t_depth, prev.getParent());
            prev.getParent().addChild(t_section);
        }
        else {
            while (t_depth < prev.getDepth()) {
                prev = prev.getParent();
            }
            // at this point, (t_depth == prev.getDepth()) = true
            Section t_section = new Section(text_of_line, t_depth, prev.getParent());
            prev.getParent().addChild(t_section);
        }
    }
}

I have glossed over some finer points of the pseudo-code, but I think you get the overall idea of how to go about this parsing. Do remember to implement the methods addChild(), getDepth(), getParent(), etc.

like image 115
Chthonic Project Avatar answered Nov 08 '22 02:11

Chthonic Project


Surprisingly complex problem... but here's a pseudocode

intialize a stack
push first line to stack
while (there are more lines to read) {
 S1 = top of stack // do not pop off yet
 S2 = read a line
 if depth of S1 < depth of S2 {
  add S2 as child of S1
  push S2 into stack
 }
 else {
  while (depth of S1 >= depth of S2 AND there are at least 2 elements in stack) {
   pop stack
   S1 = top of stack // do not pop
  }
  add S2 as child of S1
  push S2 into stack
 }
}
return bottom element of stack

where depth is the # leading whitespaces. You might have to either modify or wrap the Section class to store the depth of the line.

like image 4
Max Seo Avatar answered Nov 08 '22 03:11

Max Seo


An implementation in C#

Based on this answer, I've created a C# solution.

It allows for multiple roots and assumes an input structure like:

Test
    A
    B
    C
        C1
        C2
    D
Something
    One
    Two
    Three

An example usage of the code would be:

var lines = new[]
{
    "Test",
    "\tA",
    "\tB",
    "\tC",
    "\t\tC1",
    "\t\tC2",
    "\tD",
    "Something",
    "\tOne",
    "\tTwo",
    "\tThree"
};

var roots = IndentedTextToTreeParser.Parse(lines, 0, '\t');

var dump = IndentedTextToTreeParser.Dump(roots);
Console.WriteLine(dump);

You can specify the root indention (zero by default) and also the idention character (a tab \t by default).

The full code:

namespace MyNamespace
{
    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.Text;

    public static class IndentedTextToTreeParser
    {
        // https://stackoverflow.com/questions/21735468/parse-indented-text-tree-in-java

        public static List<IndentTreeNode> Parse(IEnumerable<string> lines, int rootDepth = 0, char indentChar = '\t')
        {
            var roots = new List<IndentTreeNode>();

            // --

            IndentTreeNode prev = null;

            foreach (var line in lines)
            {
                if (string.IsNullOrEmpty(line.Trim(indentChar)))
                    throw new Exception(@"Empty lines are not allowed.");

                var currentDepth = countWhiteSpacesAtBeginningOfLine(line, indentChar);

                if (currentDepth == rootDepth)
                {
                    var root = new IndentTreeNode(line, rootDepth);
                    prev = root;

                    roots.Add(root);
                }
                else
                {
                    if (prev == null)
                        throw new Exception(@"Unexpected indention.");
                    if (currentDepth > prev.Depth + 1)
                        throw new Exception(@"Unexpected indention (children were skipped).");

                    if (currentDepth > prev.Depth)
                    {
                        var node = new IndentTreeNode(line.Trim(), currentDepth, prev);
                        prev.AddChild(node);

                        prev = node;
                    }
                    else if (currentDepth == prev.Depth)
                    {
                        var node = new IndentTreeNode(line.Trim(), currentDepth, prev.Parent);
                        prev.Parent.AddChild(node);

                        prev = node;
                    }
                    else
                    {
                        while (currentDepth < prev.Depth) prev = prev.Parent;

                        // at this point, (currentDepth == prev.Depth) = true
                        var node = new IndentTreeNode(line.Trim(indentChar), currentDepth, prev.Parent);
                        prev.Parent.AddChild(node);
                    }
                }
            }

            // --

            return roots;
        }

        public static string Dump(IEnumerable<IndentTreeNode> roots)
        {
            var sb = new StringBuilder();

            foreach (var root in roots)
            {
                doDump(root, sb, @"");
            }

            return sb.ToString();
        }

        private static int countWhiteSpacesAtBeginningOfLine(string line, char indentChar)
        {
            var lengthBefore = line.Length;
            var lengthAfter = line.TrimStart(indentChar).Length;
            return lengthBefore - lengthAfter;
        }

        private static void doDump(IndentTreeNode treeNode, StringBuilder sb, string indent)
        {
            sb.AppendLine(indent + treeNode.Text);
            foreach (var child in treeNode.Children)
            {
                doDump(child, sb, indent + @"    ");
            }
        }
    }

    [DebuggerDisplay(@"{Depth}: {Text} ({Children.Count} children)")]
    public class IndentTreeNode
    {
        public IndentTreeNode(string text, int depth = 0, IndentTreeNode parent = null)
        {
            Text = text;
            Depth = depth;
            Parent = parent;
        }

        public string Text { get; }
        public int Depth { get; }
        public IndentTreeNode Parent { get; }
        public List<IndentTreeNode> Children { get; } = new List<IndentTreeNode>();

        public void AddChild(IndentTreeNode child)
        {
            if (child != null) Children.Add(child);
        }
    }
}

I've also included an method Dump() to convert the tree back to a string in order to better debug the algorithm itself.

like image 3
Uwe Keim Avatar answered Nov 08 '22 04:11

Uwe Keim