Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML inside node using ElementTree

I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:

<Course>
    <Description>Line 1<br />Line 2</Description>
</Course>

Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:

desc = _course.find("Description").text;

But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:

desc = _course.find("Description").tail;

And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).

like image 480
Rafael Almeida Avatar asked Feb 28 '23 13:02

Rafael Almeida


1 Answers

Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<', etc) should be encoded to avoid this problem. You can do this with either:

  • a CDATA section
  • Base64 or some other encoding (which doesn't include xml reserved characters)
  • Entity encoding ('<' == '&lt;')

If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.

like image 172
Dana the Sane Avatar answered Mar 07 '23 10:03

Dana the Sane