Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding a nonrecursive DOM subnode in Python using BeautifulSoup

Is there any way to find a nonrecursive DOM subnode in Python using BeautifulSoup?

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

If I want to get groupId at the top level (specifically project->groupId, not project->parent->groupId), I use:

with open(pom) as pomHandle:
    soup = BeautifulSoup(pomHandle)

groupId = soup.groupid.text

But unfortunately, that finds the first physical occurrence of groupId in the file regardless of the hierarchy level, which is project->parent->groupId. I actually want to do a unrecursive find ONLY at a specific node level, not within its children. Is there a way to do it in BeautifulSoup?

like image 712
amphibient Avatar asked Jan 15 '14 20:01

amphibient


People also ask

How do I find a specific element with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

How do I exclude tags in BeautifulSoup?

Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

What is BeautifulSoup library in Python?

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.


1 Answers

You can search inside "project" node with recursive=False:

groupId = soup.project.find('groupid', recursive=False).text

Hope that helps.

like image 90
alecxe Avatar answered Oct 20 '22 19:10

alecxe