Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which Microdata parser should I use in Python [closed]

I'm looking for a good quality HTML Microdata parser in Python. It doesn't have to be blazing fast but I'd like it to support as much of the spec as possible including itemref.

Here's what I've found so far:

  • https://github.com/edsu/microdata
  • https://github.com/RDFLib/pymicrodata
  • https://pypi.python.org/pypi/pelican-microdata/0.1

Have you used any of these libraries? What were the pros and cons?

I'm also curious about parsing poorly formatted HTML documents. Have you found a Microdata parser that handles messy input or do you run the input through something like BeautifulSoup first?

like image 280
Shawn Simister Avatar asked Apr 02 '13 07:04

Shawn Simister


1 Answers

What format do you want the Microdata parsed to?

https://github.com/RDFLib/pymicrodata will parse to RDF.

If you want JSON instead you should use https://github.com/edsu/microdata, which has recently gotten some attention and should be more conformant to the spec.

https://pypi.python.org/pypi/pelican-microdata/0.1 looks like a way to generate Microdata for a particular static site generator, so I don't think it will help with parsing.

I don't know how tolerant to poorly formatted HTML either of the above parsers are. If you know of some poorly formatted markup on the wild that uses Microdata, I'd be interested in seeing how well the Ruby parsers handle these cases.

like image 122
Jason R Avatar answered Oct 19 '22 01:10

Jason R