I'm parsing HTML pages with lxml. The pages have meta tags as follows:
<meta property="og:locality" content="Detroit" />
<meta property="og:country-name" content="USA" />
How can I use lxml to find the value of the og:locality
meta tag on each page, efficiently?
I've currently got the following, which just manually matches up meta tags by property:
for meta in doc3.cssselect('meta'):
prop = meta.get('property')
if prop === 'og:locality':
lat = meta.get('content')
But it doesn't feel very efficient.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.
fromstring . This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.
You could use this XPath selector: //meta[@property='og:locality']/@content
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With