Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find_all with camelCase tag names with BeautifulSoup 4

I'm trying to scrape an xml file with BeautifulSoup 4.4.0 that has tag names in camelCase and find_all doesn't seem to be able to find them. Example code:

from bs4 import BeautifulSoup

xml = """
<hello>
    world
</hello>
"""
soup = BeautifulSoup(xml, "lxml")

for x in soup.find_all("hello"):
    print x

xml2 = """
<helloWorld>
    :-)
</helloWorld>
"""
soup = BeautifulSoup(xml2, "lxml")

for x in soup.find_all("helloWorld"):
    print x

The output I get is:

$ python soup_test.py
<hello>
    world
</hello>

What's the correct way to look up camel cased/uppercased tag names?

like image 790
Paul Johnson Avatar asked Jul 21 '15 23:07

Paul Johnson


1 Answers

For any case-sensitive parsing using BeautifulSoup, you would want to parse in "xml" mode. The default mode (parsing HTML) doesn't care about case, since HTML doesn't care about case. In your case, instead of using "lxml" mode, switch it to "xml":

from bs4 import BeautifulSoup

xml2 = """
<helloWorld>
    :-)
</helloWorld>
"""
soup = BeautifulSoup(xml2, "xml")

for x in soup.find_all("helloWorld"):
    print x
like image 94
heinst Avatar answered Oct 20 '22 21:10

heinst