Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set an attribute without a value with LXML xml

Tags:

python

lxml

I want:

<div data-a>

But LXML API seems to give me only this:

<div data-a=''>

How do I get value-less attributes?


Its annoying that blank values and null values are represented by LXML as a blank string.

Setting None value does not help.

In [19]: from lxml.html import fromstring, tostring

In [20]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')

In [21]: b.attrib
Out[21]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}

In [22]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')

In [23]: b.attrib
Out[23]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}

In [24]: b.attrib['data-y'] = None
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-1f55133e3dc4> in <module>()
----> 1 b.attrib['data-y'] = None

/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:58775)()

/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)()

/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._utf8 (src/lxml/lxml.etree.c:26460)()

TypeError: Argument must be bytes or unicode, got 'NoneType'


tag.attrib['data-a'] = None
TypeError: Argument must be bytes or unicode, got 'NoneType'
like image 290
Jesvin Jose Avatar asked Sep 27 '22 13:09

Jesvin Jose


People also ask

Can XML attributes be empty?

An element with no content is said to be empty. The two forms produce identical results in XML software (Readers, Parsers, Browsers). Empty elements can have attributes.

Is XML and lxml are same?

lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


2 Answers

IMHO, lxml is demonstrating the expected behavior. Attribute without value makes non well-formed XML, and decent XML parser don't produce non well-formed XML :

  • about attribute without value in XML : Is an xml attribute without a value, valid?
  • about the term well-formed XML : Is there a difference between 'valid xml' and 'well formed xml'?
like image 142
har07 Avatar answered Nov 01 '22 12:11

har07


Looks like you are actually trying to manipulate HTML and not XML. If that is true, then use lxml.html instead of lxml.etree.

You are trying to set a "boolean attribute" which is not to be confused with a "boolean value" (see boolean-attributes). As already stated in the other answer, the boolean attribute syntax is not allowed.e

However, since it seems obvious that you are trying to manipulate HTML, you create a boolean attribute with an HTML Element not an XML Element.

import unittest

import lxml.html

class HtmlBooleanAttribute(unittest.TestCase):

    def test_booleanAttribute(self):

        # !!! BE SURE TO CREATE AN ****HTML**** ELEMENT !!!
        div = lxml.html.Element('div')

        # Set a boolean attribute; omitting the value or providing None will
        # create a boolean attribute.
        div.set('data-a')
        div.set('data-b', None)

        # Setting the value to an empty will not give you a boolean attribute
        div.set('data-c', '')

        # Set a normal attribute for comparison
        div.set('class','big red')

        print
        print lxml.html.tostring(div)
        print

        # Note that 'data-a' will be a zero-length string
        print 'data-a = ', div.get('data-a')
        print 'type(data-a) = ', type(div.get('data-a'))
        print 'len(data-a) = ', len(div.get('data-a'))

        print

        print 'data-c = ', div.get('data-c')
        print 'type(data-c) = ', type(div.get('data-c'))
        print 'len(data-c) = ', len(div.get('data-c'))






if __name__ == "__main__":
    #import sys;sys.argv = ['', 'Test.testName']
    unittest.main()

Output

<div data-a data-b data-c="" class="big red"></div>

data-a =  
type(data-a) =  <type 'str'>
len(data-a) =  0

data-c =  
type(data-c) =  <type 'str'>
len(data-c) =  0
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that data-a and data-b are both zero-length strings but they print differently.

like image 43
shrewmouse Avatar answered Nov 01 '22 12:11

shrewmouse