I'm learning BeautifulSoup, and found many "html2text" solutions, but the one i'm looking for should mimic the formatting: <pre class="prettyprint"><code><ul> <li>One</li> <li>Two</li> </ul> </code></pre> Would become <pre class="prettyprint"><code>* One * Two </code></pre> and <pre class="prettyprint"><code>Some text <blockquote> More magnificent text here </blockquote> Final text </code></pre> to <pre class="prettyprint"><code>Some text More magnificent text here Final text </code></pre> I'm reading the docs, but I'm not seeing anything straight forward. Any help? I'm open to using something other than beautifulsoup.

Take a look at Aaron Swartz's html2text script (can be installed with <code>pip install html2text</code>). Note that the output is valid Markdown. If for some reason that doesn't fully suit you, some rather trivial tweaks should get you the exact output in your question: <pre class="prettyprint"><code>In [1]: import html2text In [2]: h1 = """<ul> ...: <li>One</li> ...: <li>Two</li> ...: </ul>""" In [3]: print html2text.html2text(h1) * One * Two In [4]: h2 = """Some text ...: <blockquote> ...: More magnificent text here ...: </blockquote> ...: Final text""" In [5]: print html2text.html2text(h2) Some text > More magnificent text here Final text </code></pre>

Python convert html to text and mimic formatting

Tags:

python

html

beautifulsoup

I'm learning BeautifulSoup, and found many "html2text" solutions, but the one i'm looking for should mimic the formatting:

<ul>
<li>One</li>
<li>Two</li>
</ul>

Would become

* One
* Two

and

Some text
<blockquote>
More magnificent text here
</blockquote>
Final text

Some text

    More magnificent text here

Final text

I'm reading the docs, but I'm not seeing anything straight forward. Any help? I'm open to using something other than beautifulsoup.

702

asked Mar 25 '13 05:03

Mikhail

3 Answers

Python's built-in html.parser (HTMLParser in earlier versions) module can be easily extended to create a simple translator that you can tailor to your exact needs. It lets you hook into certain events as the parser eats through the HTML.

Due to its simple nature you cant navigate around the HTML tree like you could with Beautiful Soup (e.g. sibling, child, parent nodes etc) but for a simple case like yours it should be enough.

html.parser homepage

In your case you could use it like this by adding the appropriate formatting whenever a start-tag or end-tag of a specific type is encountered :

from html.parser import HTMLParser
from os import linesep

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self, strict=False)
    def feed(self, in_html):
        self.output = ""
        super(MyHTMLParser, self).feed(in_html)
        return self.output
    def handle_data(self, data):
        self.output += data.strip()
    def handle_starttag(self, tag, attrs):
        if tag == 'li':
            self.output += linesep + '* '
        elif tag == 'blockquote' :
            self.output += linesep + linesep + '\t'
    def handle_endtag(self, tag):
        if tag == 'blockquote':
            self.output += linesep + linesep

parser = MyHTMLParser()
content = "<ul><li>One</li><li>Two</li></ul>"
print(linesep + "Example 1:")
print(parser.feed(content))
content = "Some text<blockquote>More magnificent text here</blockquote>Final text"
print(linesep + "Example 2:")
print(parser.feed(content))

answered Nov 09 '22 08:11

samaspin

Take a look at Aaron Swartz's html2text script (can be installed with pip install html2text). Note that the output is valid Markdown. If for some reason that doesn't fully suit you, some rather trivial tweaks should get you the exact output in your question:

In [1]: import html2text

In [2]: h1 = """<ul>
   ...: <li>One</li>
   ...: <li>Two</li>
   ...: </ul>"""

In [3]: print html2text.html2text(h1)
  * One
  * Two

In [4]: h2 = """<p>Some text
   ...: <blockquote>
   ...: More magnificent text here
   ...: </blockquote>
   ...: Final text</p>"""

In [5]: print html2text.html2text(h2)
Some text

> More magnificent text here

Final text

answered Nov 09 '22 10:11

root

I have code for a more simple task: Remove HTML tags, and insert newlines at the appropriate places. Maybe this can be a starting point for you.

Python's textwrap module might be helpful for creating indented blocks of text.

http://docs.python.org/2/library/textwrap.html

class HtmlTool(object):
    """
    Algorithms to process HTML.
    """
    #Regular expressions to recognize different parts of HTML. 
    #Internal style sheets or JavaScript 
    script_sheet = re.compile(r"<(script|style).*?>.*?(</\1>)", 
                              re.IGNORECASE | re.DOTALL)
    #HTML comments - can contain ">"
    comment = re.compile(r"<!--(.*?)-->", re.DOTALL) 
    #HTML tags: <any-text>
    tag = re.compile(r"<.*?>", re.DOTALL)
    #Consecutive whitespace characters
    nwhites = re.compile(r"[\s]+")
    #<p>, <div>, <br> tags and associated closing tags
    p_div = re.compile(r"</?(p|div|br).*?>", 
                       re.IGNORECASE | re.DOTALL)
    #Consecutive whitespace, but no newlines
    nspace = re.compile("[^\S\n]+", re.UNICODE)
    #At least two consecutive newlines
    n2ret = re.compile("\n\n+")
    #A return followed by a space
    retspace = re.compile("(\n )")

    #For converting HTML entities to unicode
    html_parser = HTMLParser.HTMLParser()

    @staticmethod
    def to_nice_text(html):
        """Remove all HTML tags, but produce a nicely formatted text."""
        if html is None:
            return u""
        text = unicode(html)
        text = HtmlTool.script_sheet.sub("", text)
        text = HtmlTool.comment.sub("", text)
        text = HtmlTool.nwhites.sub(" ", text)
        text = HtmlTool.p_div.sub("\n", text) #convert <p>, <div>, <br> to "\n"
        text = HtmlTool.tag.sub("", text)     #remove all tags
        text = HtmlTool.html_parser.unescape(text)
        #Get whitespace right
        text = HtmlTool.nspace.sub(" ", text)
        text = HtmlTool.retspace.sub("\n", text)
        text = HtmlTool.n2ret.sub("\n\n", text)
        text = text.strip()
        return text

There might be some superfluous regexes left in the code.

answered Nov 09 '22 10:11

Eike

Related questions
                            
                                casting into a Python string from a char[] returned by a DLL
                            
                                Most efficient way to parse a large .csv in python?
                            
                                Check that a *type* of file exists in Python
                            
                                Profiling a long-running Python Server
                            
                                Django - DatabaseError: No such table
                            
                                Using '\displaymath' directives in docstrings formulas
                            
                                Consumer Connection error with django and celery+rabbitmq?
                            
                                Selenium python find_element_by_class_name() stopped working from v 2.2 to 2.21 -- cannot use 'Compound Class Name'
                            
                                How to disable translations during unit tests in django?
                            
                                Python creating a list with itertools.product?
                            
                                How to set up Pylint to only do some inspections
                            
                                removing first four and last four characters of strings in list, OR removing specific character patterns
                            
                                How to go back to first if statement if no choices are valid
                            
                                Reverse an arbitrary dimension in an ndarray
                            
                                Fast interpolation over 3D array
                            
                                python tar file how to extract file into stream
                            
                                How do Django forms sanitize text input to prevent SQL injection, XSS, etc?
                            
                                How to push a whole sequence to redis in Python [duplicate]
                            
                                How do I mock a superclass's __init__ create an attribute containing a mock object for a unit test?
                            
                                Error while using listdir in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With