This small program:
from lxml.html import tostring, fromstring
e = fromstring('''
<html><head>
<link href="/comments.css" rel="stylesheet" type="text/css">
<link href="/index.css" rel="stylesheet" type="text/css">
</head>
<body>
<span></span>
<span></span>
</body>
</html>''')
print (tostring(e, encoding=str)) #unicode on python 2
will print:
<html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link
href="/index.css" rel="stylesheet" type="text/css"></head><body>
<span></span>
<span></span>
</body></html>
The spaces and line breaks in head removed. This happens even if we place the two <link> elements in <body>. It seems blank text nodes (\s*) between head elements are removed.
How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
etree module. The lxml. etree module implements the extended ElementTree API for XML.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
for me
print (tostring(e, encoding=str))
returns
>>> print (tostring(e, encoding=str))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring
encoding=encoding)
File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416)
TypeError: descriptor 'upper' of 'str' object needs an argument
I cannot speak to the descrepencey, but I do suggest setting the argument pretty_print
to true
>>> etree.tostring(e, pretty_print=True)
'<html>\n <head>\n <link href="/comments.css" rel="stylesheet" type="text/css"/>\n <link href="/index.css" rel="stylesheet" type="text/css"/>\n </head>\n <body>\n <span/>\n <span/>\n </body>\n</html>\n'
you will need to import etree from lxml import etree
when outputted to an outfile the spaces and newlines will be perserved. Also with print
>>> print(etree.tostring(e, pretty_print=True))
<html>
<head>
<link href="/comments.css" rel="stylesheet" type="text/css"/>
<link href="/index.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<span/>
<span/>
</body>
</html>
I am sure you have checked out the API, but incase you haven't here is information on tostring(). It is also safe to assume you have seen the tutorial on the lxml website. I would love to see some more 'good' resources. I am new to lxml myself and anything new and good to read would be welcomed.
Updated
you said you wouldconsider sed
if you could not find a good python solution.
this should accomplish it with sed
sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html
this is running two sed
procedures. the first deletes the first 2 lines. the second inserts <html><head>
on the first line.
UPDATE #2
I should have thought about this more. you can do this with python
>>> import re
>>> newString = re.sub('\n ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1)
>>> print(newString)
<html><head>
<link href="/comments.css" rel="stylesheet" type="text/css"/>
<link href="/index.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<span/>
<span/>
</body>
</html>
Finally, I used html5lib to parse html and generate lxml like tree with it.
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With