Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing meta tags efficiently with lxml?

I'm parsing HTML pages with lxml. The pages have meta tags as follows:

<meta property="og:locality" content="Detroit" />
<meta property="og:country-name" content="USA" />

How can I use lxml to find the value of the og:locality meta tag on each page, efficiently?

I've currently got the following, which just manually matches up meta tags by property:

for meta in doc3.cssselect('meta'):
    prop = meta.get('property')
    if prop === 'og:locality':
        lat = meta.get('content')

But it doesn't feel very efficient.

like image 745
Richard Avatar asked Nov 15 '11 18:11

Richard


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What does lxml do?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml in BeautifulSoup?

To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.

What does HTML Fromstring do?

fromstring . This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.


1 Answers

You could use this XPath selector: //meta[@property='og:locality']/@content

like image 158
Acorn Avatar answered Oct 28 '22 05:10

Acorn