Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse 'ul' and 'ol' tags

I have to handle deep nesting of ul, ol, and li tags. I need to give the same view as we are giving in the browser. I want to achieve the following example in a pdf file:

 text = "
<body>
    <ol>
        <li>One</li>
        <li>Two

            <ol>
                <li>Inner One</li>
                <li>inner Two

                    <ul>
                        <li>hey

                            <ol>
                                <li>hiiiiiiiii</li>
                                <li>why</li>
                                <li>hiiiiiiiii</li>
                            </ol>
                        </li>
                        <li>aniket </li>
                    </li>
                </ul>
                <li>sup </li>
                <li>there </li>
            </ol>
            <li>hey </li>
            <li>Three</li>
        </li>
    </ol>
    <ol>
        <li>Introduction</li>
        <ol>
            <li>Introduction</li>
        </ol>
        <li>Description</li>
        <li>Observation</li>
        <li>Results</li>
        <li>Summary</li>
    </ol>
    <ul>
        <li>Introduction</li>
        <li>Description

            <ul>
                <li>Observation

                    <ul>
                        <li>Results

                            <ul>
                                <li>Summary</li>
                            </ul>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Overview</li>
    </ul>
</body>"

I have to use prawn for my task. But prawn doesn't support HTML tags. So, I came up with a solution using nokogiri:. I am parsing and later removing the tags with gsub. The below solution I have written for a part of the above content but the problem is ul and ol can vary.

     RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end


  puts doc.inner_text


1. One
2. Two

1. Inner One
2. inner Two

• hey

1. hiiiiiiiii
2. why
3. hiiiiiiiii


• aniket 


3. sup 
4. there 

3. hey 
4. Three



1. Introduction

1. Introduction

2. Description
3. Observation
4. Results
5. Summary



• Introduction
• Description

• Observation

• Results

• Summary






• Overview

Problem

1) What I want to achieve is how to handle space when working with ul and ol tags
2) How to handle deep nesting when li come inside ul or li come inside ol

like image 593
Aniket Tiwari Avatar asked May 14 '18 08:05

Aniket Tiwari


2 Answers

I've come up with a solution that handles multiple identations with configurable numeration rules per level:

require 'nokogiri'
ROMANS = %w[i ii iii iv v vi vii viii ix]

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{('a'..'z').to_a[index]}. " },
    3 => ->(index) { "#{ROMANS.to_a[index]}. " },
    4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "\u25E6 " },
    3 => ->(_) { "* " },
    4 => ->(_) { "- " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol:root').each do |group|
  binding.pry
  ol_rule(group, deepness: 1)
end

doc.search('ul:root').each do |group|
  ul_rule(group, deepness: 1)
end

You can then remove the tags or use doc.inner_text depending on your environment.

Two caveats though:

  1. Your entry selector must be carefully selected. I used your snippet verbatim without root element, thus i had to use ul:root/ol:root. Maybe "body > ol" works for your situation too. Maybe selecting each ol/ul but than walking each and only find those, that have no list parent.
  2. Using your example verbatim, Nokogiri does not handle the last 2 list items of the first group ol very well ("hey", "Three") When parsing with nokogiri, thus elements already "left" their ol tree and got placed in the root tree.

Current Output:

  1. One
  2. Two
      a. Inner One
      b. inner Two
        ◦ hey
        ◦ hey
      3. hey
      4. hey
  hey
  Three

  1. Introduction
    a. Introduction
  2. Description
  3. Observation
  4. Results
  5. Summary

  • Introduction
  • Description
      ◦ Observation
          * Results
              - Summary
  • Overview
like image 94
stwienert Avatar answered Nov 13 '22 08:11

stwienert


Firstly for handling space, I have used a hack in the lambda call. Also, I am using add_previous_sibling function given by nokogiri to append something in starting. Lastly Prawn doesn't handle space when we deal with ul & ol tags so for that I have used this gsub gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }. You can read more from this link

Note: Nokogiri doesn't handle invalid HTML so always provide valid HTML

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  },
  space: {
    1 => ->(index) { " "  },
    2 => ->(index) { "  " },
    3 => ->(index) { "   " },
    4 => ->(index) { "    " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    space = RULES[:space][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    space = RULES[:space][deepness].call(i)
    prefix = RULES[:ul][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.parse(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end

Prawn::Document.generate("hello.pdf") do
  #puts doc.inner_text
  text doc.at('body').children.to_html.gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }.gsub("<ul>","").gsub("<\/ul>","").gsub("<ol>","").gsub("<\/ol>","").gsub("<li>", "").gsub("</li>","").gsub("\\n","").gsub(/[\n]+/, "\n")
end
like image 42
Aniket Tiwari Avatar answered Nov 13 '22 09:11

Aniket Tiwari