Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible for lxml to work in a case-insensitive manner?

I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.

I'd like to be able to say doc.cssselect('meta[name=description]') (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D.

I'm currently using this as a workaround, but it's horrible!

for meta in doc.cssselect('meta'):
    name = meta.get('name')
    content = meta.get('content')

    if name and content:
        if name.lower() == 'keywords':
            keywords = content
        if name.lower() == 'description':
            description = content

It seems that the tag name meta is treated case-insensitively, but the attributes are not. It would be even more annoying meta was case-sensitive too!

like image 541
Mat Avatar asked Nov 14 '09 12:11

Mat


2 Answers

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">
like image 71
jfs Avatar answered Nov 26 '22 14:11

jfs


lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. Its only drawback is that it can be slow.

like image 21
Ned Batchelder Avatar answered Nov 26 '22 14:11

Ned Batchelder