Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lxml parsing svg file

Tags:

python

svg

lxml

I'm trying to parse .svg files from http://kanjivg.tagaini.net/ , but I can't successfully extract the information inside.

Edit 1:(full file) http://www.filedropper.com/0f9ab

A part of 0f9ab.svg looks like this:

<svg xmlns="http://www.w3.org/2000/svg" width="109" height="109" viewBox="0 0 109 109">
<g id="kvg:StrokePaths_0f9ab" style="fill:none;stroke:#000000;stroke-width:3;stroke-linecap:round;stroke-linejoin:round;">
<g id="kvg:0f9ab" kvg:element="嶺">
    <g id="kvg:0f9ab-g1" kvg:element="山" kvg:position="top" kvg:radical="general">
        <path id="kvg:0f9ab-s1" kvg:type="㇑a" d="M53.26,9.38c0.99,0.99,1.12,2.09,1.12,3.12c0,0.67,0.06,8.38,0.06,13.01"/>
        <path id="kvg:0f9ab-s2" kvg:type="㇄a"
    </g>
</g>
</g>

My .py file:

import lxml.etree as ET

svg = ET.parse('0f9ab.svg')
print(svg)  # <lxml.etree._ElementTree object at 0x7f3a2f659ec8>

# AttributeError: 'lxml.etree._ElementTree' object has no attribute 'tag'
print(svg.tag)

# TypeError: 'lxml.etree._ElementTree' object is not subscriptable
print(svg[0])

# TypeError: 'lxml.etree._ElementTree' object is not iterable
for child in svg:
    print(child)

# None
print(svg.find("./svg"))

# []
print(svg.findall("//g"))

# []
print(svg.xpath("//g"))

Purpose

I tried all kinds of operations I could think of, but nothing gets me any data from the .svg file. I want to extract the kanji (Japanese character) in kvg:element="kanji" (which are at different depth levels).

Question

  1. Is using lxml the wrong package for this?
  2. If not, how do I extract information from my parsed .svg file?

Other solution

  • I could of course I could just read the file as a string and search for kvg:element=", but I would like to proper way of extracting xml / svg.
  • I used xmltodict before, but my code became really messy extracting kvg:element, because they were at different depth levels.
like image 326
NumesSanguis Avatar asked Nov 07 '16 16:11

NumesSanguis


People also ask

Is XML and lxml are same?

lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.

Is lxml a parser?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is lxml package in Python?

lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.


1 Answers

.parse() returns an ElementTree, which represents the tree as a whole. To query individual nodes, you need an Element, most likely the root element of the tree.

Replace part of your code with this:

xml = ET.parse('0f9ab.svg')
svg = xml.getroot()
print(svg)  # <lxml.etree._ElementTree object at 0x7f3a2f659ec8>

and I think you'll have some success.

Note also that .findall() requires a relative path and, in your case, a namespace qualifier:

print(svg.findall(".//{http://www.w3.org/2000/svg}g"))
like image 62
Robᵩ Avatar answered Sep 21 '22 12:09

Robᵩ