This small program: <pre class="prettyprint"><code>from lxml.html import tostring, fromstring e = fromstring(''' <html><head> <link href="/comments.css" rel="stylesheet" type="text/css"> <link href="/index.css" rel="stylesheet" type="text/css"> </head> <body> </body> </html>''') print (tostring(e, encoding=str)) #unicode on python 2</code></pre> will print: <pre class="prettyprint"><code><html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link href="/index.css" rel="stylesheet" type="text/css"></head><body> </body></html></code></pre> The spaces and line breaks in head removed. This happens even if we place the two <link> elements in <body>. It seems blank text nodes (\s*) between head elements are removed. How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)

for me <code>print (tostring(e, encoding=str))</code> returns <pre class="prettyprint"><code>>>> print (tostring(e, encoding=str)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring encoding=encoding) File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416) TypeError: descriptor 'upper' of 'str' object needs an argument </code></pre> I cannot speak to the descrepencey, but I do suggest setting the argument <code>pretty_print</code> to true <pre class="prettyprint"><code>>>> etree.tostring(e, pretty_print=True) '<html>\n <head>\n <link href="/comments.css" rel="stylesheet" type="text/css"/>\n <link href="/index.css" rel="stylesheet" type="text/css"/>\n </head>\n <body>\n \n \n </body>\n</html>\n' </code></pre> you will need to import etree <code>from lxml import etree</code> when outputted to an outfile the spaces and newlines will be perserved. Also with <code>print</code> <pre class="prettyprint"><code>>>> print(etree.tostring(e, pretty_print=True)) <html> <head> <link href="/comments.css" rel="stylesheet" type="text/css"/> <link href="/index.css" rel="stylesheet" type="text/css"/> </head> <body> </body> </html> </code></pre> I am sure you have checked out the API, but incase you haven't here is information on tostring(). It is also safe to assume you have seen the tutorial on the lxml website. I would love to see some more 'good' resources. I am new to lxml myself and anything new and good to read would be welcomed. Updated you said you wouldconsider <code>sed</code> if you could not find a good python solution. this should accomplish it with <code>sed</code> <code>sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html</code> this is running two <code>sed</code> procedures. the first deletes the first 2 lines. the second inserts <code><html><head></code> on the first line. UPDATE #2 I should have thought about this more. you can do this with python <pre class="prettyprint"><code> >>> import re >>> newString = re.sub('\n ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1) >>> print(newString) <html><head> <link href="/comments.css" rel="stylesheet" type="text/css"/> <link href="/index.css" rel="stylesheet" type="text/css"/> </head> <body> </body> </html> </code></pre>

Finally, I used html5lib to parse html and generate lxml like tree with it. <code>parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)</code>

lxml removes spaces and line breaks in <head>

Tags:

python

python-3.x

html-parsing

lxml

This small program:

from lxml.html import tostring, fromstring
e = fromstring('''
<html><head>
        <link href="/comments.css" rel="stylesheet" type="text/css">
        <link href="/index.css" rel="stylesheet" type="text/css">
    </head>
    <body>
        <span></span>
        <span></span>
    </body>
</html>''')

print (tostring(e, encoding=str)) #unicode on python 2

will print:

<html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link
href="/index.css" rel="stylesheet" type="text/css"></head><body>
        <span></span>
        <span></span>
    </body></html>

The spaces and line breaks in head removed. This happens even if we place the two <link> elements in <body>. It seems blank text nodes (\s*) between head elements are removed.

How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)

272

asked Jun 24 '11 14:06

Taha Jahangir

2 Answers

for me

print (tostring(e, encoding=str))

returns

>>> print (tostring(e, encoding=str))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416)
TypeError: descriptor 'upper' of 'str' object needs an argument

I cannot speak to the descrepencey, but I do suggest setting the argument pretty_print to true

>>> etree.tostring(e, pretty_print=True)
'<html>\n  <head>\n    <link href="/comments.css" rel="stylesheet" type="text/css"/>\n    <link href="/index.css" rel="stylesheet" type="text/css"/>\n  </head>\n  <body>\n        <span/>\n        <span/>\n    </body>\n</html>\n'

you will need to import etree from lxml import etree

when outputted to an outfile the spaces and newlines will be perserved. Also with print

>>> print(etree.tostring(e, pretty_print=True))
<html>
  <head>
    <link href="/comments.css" rel="stylesheet" type="text/css"/>
    <link href="/index.css" rel="stylesheet" type="text/css"/>
  </head>
  <body>
        <span/>
        <span/>
    </body>
</html>

I am sure you have checked out the API, but incase you haven't here is information on tostring(). It is also safe to assume you have seen the tutorial on the lxml website. I would love to see some more 'good' resources. I am new to lxml myself and anything new and good to read would be welcomed.

Updated

you said you wouldconsider sed if you could not find a good python solution.

this should accomplish it with sed

sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html

this is running two sed procedures. the first deletes the first 2 lines. the second inserts <html><head> on the first line.

UPDATE #2

I should have thought about this more. you can do this with python

    >>> import re
    >>> newString = re.sub('\n  ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1)
    >>> print(newString)
      <html><head>
            <link href="/comments.css" rel="stylesheet" type="text/css"/>
            <link href="/index.css" rel="stylesheet" type="text/css"/>
         </head>
         <body>
           <span/>
           <span/>
        </body>
   </html>

165

answered Sep 19 '22 10:09

matchew

Finally, I used html5lib to parse html and generate lxml like tree with it.

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)

answered Sep 18 '22 10:09

Taha Jahangir

Related questions
                            
                                Is it possible to show tooltips on a networkx graph?
                            
                                Disable logging for a particular package
                            
                                Why is subprocess throwing OSError here?
                            
                                Django Model Inheritance and Admin System
                            
                                Fast conversion of C/C++ vector to Numpy array
                            
                                Create client/server with Twisted
                            
                                Obtain SSL certificate from peer without verification using Python
                            
                                How to troubleshoot/bypass locking problem which appears to be GIL related
                            
                                Extracting Blender Original Coordinates (ORCO)
                            
                                Feeling stupid while trying to implement lazy partitioning in Python
                            
                                How to generate the PEM serialization for the public RSA/DSA key
                            
                                apache poi vs python xlrd
                            
                                Regex: Match brackets both greedy and non greedy
                            
                                (python) matplotlib pyplot show() .. blocking or not?
                            
                                NumPy/SciPy: Move mask over Image and check for equality
                            
                                GC Doesn't Delete Circular References in WeakKeyDictionaries?
                            
                                Losing elements in python code while creating a dictionary from a list?
                            
                                Generator Function Performance
                            
                                Parsing pcap files with dpkt (Python)
                            
                                Python lxml iterfind w/ namespace but prefix=None

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

lxml removes spaces and line breaks in <head>

Tags:

python

python-3.x

html-parsing

lxml

Taha Jahangir

People also ask

2 Answers

matchew

Taha Jahangir

Recent Activity

Donate For Us