I'm looking for a good quality HTML Microdata parser in Python. It doesn't have to be blazing fast but I'd like it to support as much of the spec as possible including itemref
.
Here's what I've found so far:
Have you used any of these libraries? What were the pros and cons?
I'm also curious about parsing poorly formatted HTML documents. Have you found a Microdata parser that handles messy input or do you run the input through something like BeautifulSoup first?
What format do you want the Microdata parsed to?
https://github.com/RDFLib/pymicrodata will parse to RDF.
If you want JSON instead you should use https://github.com/edsu/microdata, which has recently gotten some attention and should be more conformant to the spec.
https://pypi.python.org/pypi/pelican-microdata/0.1 looks like a way to generate Microdata for a particular static site generator, so I don't think it will help with parsing.
I don't know how tolerant to poorly formatted HTML either of the above parsers are. If you know of some poorly formatted markup on the wild that uses Microdata, I'd be interested in seeing how well the Ruby parsers handle these cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With