Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml removes spaces and line breaks in <head>

This small program:

from lxml.html import tostring, fromstring
e = fromstring('''
<html><head>
        <link href="/comments.css" rel="stylesheet" type="text/css">
        <link href="/index.css" rel="stylesheet" type="text/css">
    </head>
    <body>
        <span></span>
        <span></span>
    </body>
</html>''')

print (tostring(e, encoding=str)) #unicode on python 2

will print:

<html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link
href="/index.css" rel="stylesheet" type="text/css"></head><body>
        <span></span>
        <span></span>
    </body></html>

The spaces and line breaks in head removed. This happens even if we place the two <link> elements in <body>. It seems blank text nodes (\s*) between head elements are removed.

How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)

like image 272
Taha Jahangir Avatar asked Jun 24 '11 14:06

Taha Jahangir


People also ask

What does lxml do?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is etree in lxml?

etree module. The lxml. etree module implements the extended ElementTree API for XML.

What is lxml parser?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


2 Answers

for me

print (tostring(e, encoding=str))

returns

>>> print (tostring(e, encoding=str))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416)
TypeError: descriptor 'upper' of 'str' object needs an argument

I cannot speak to the descrepencey, but I do suggest setting the argument pretty_print to true

>>> etree.tostring(e, pretty_print=True)
'<html>\n  <head>\n    <link href="/comments.css" rel="stylesheet" type="text/css"/>\n    <link href="/index.css" rel="stylesheet" type="text/css"/>\n  </head>\n  <body>\n        <span/>\n        <span/>\n    </body>\n</html>\n'

you will need to import etree from lxml import etree

when outputted to an outfile the spaces and newlines will be perserved. Also with print

>>> print(etree.tostring(e, pretty_print=True))
<html>
  <head>
    <link href="/comments.css" rel="stylesheet" type="text/css"/>
    <link href="/index.css" rel="stylesheet" type="text/css"/>
  </head>
  <body>
        <span/>
        <span/>
    </body>
</html>

I am sure you have checked out the API, but incase you haven't here is information on tostring(). It is also safe to assume you have seen the tutorial on the lxml website. I would love to see some more 'good' resources. I am new to lxml myself and anything new and good to read would be welcomed.

Updated

you said you wouldconsider sed if you could not find a good python solution.

this should accomplish it with sed

sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html

this is running two sed procedures. the first deletes the first 2 lines. the second inserts <html><head> on the first line.

UPDATE #2

I should have thought about this more. you can do this with python

    >>> import re
    >>> newString = re.sub('\n  ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1)
    >>> print(newString)
      <html><head>
            <link href="/comments.css" rel="stylesheet" type="text/css"/>
            <link href="/index.css" rel="stylesheet" type="text/css"/>
         </head>
         <body>
           <span/>
           <span/>
        </body>
   </html>
like image 165
matchew Avatar answered Sep 19 '22 10:09

matchew


Finally, I used html5lib to parse html and generate lxml like tree with it.

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)

like image 32
Taha Jahangir Avatar answered Sep 18 '22 10:09

Taha Jahangir