beautifulsoup findall

Question

I have some xml:

<article>
<uselesstag></uslesstag>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<uselesstag></uslesstag>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<uselesstag></uslesstag>
<topic>cars</topic>
<body>body text</body>
</article>

There are many, many useless tags. I want to use beautifulsoup to collect all of the text in the body tags and their associated topic text to create some new xml.

I am new to python, but I suspect that some form of

import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO

import BeautifulSoup
from BeautifulSoup import BeautifulSoup

totstring=""

with open('reut2-000.sgm', 'r') as inF:
    for line in inF:
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
    totstring+=string


soup = BeautifulSoup(totstring)

body = soup.find("body")



for anchor in soup.findAll('body'):
    #Stick body and its topics in an associated array?




file.close

will work.

1) How do I do it? 2) Should I add a root node to the XML? otherwise it's not proper XML is it?

Thanks very much

Edit:

What i want to end up with is:

<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<topic>cars</topic>
<body>body text</body>
</article>

There are many, many useless tags.

Arthur Neves · Accepted Answer

ok. here is the solution,

first, make sure that u had 'beautifulsoup4' installed: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

here is my code to get all body and topic tags:

from bs4 import BeautifulSoup
html_doc= """
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
soup = BeautifulSoup(html_doc)

bodies = [a.get_text() for a in soup.find_all('body')]
topics = [a.get_text() for a in soup.find_all('topic')]

Kao · Answer

Another way to remove empty xml or html tags is to use a recursive function to search for empty tags and remove them using .extract(). This way, you don't have to manually list out which tags you want to keep. It also enables cleaning of empty tags that are nested.

from bs4 import BeautifulSoup
import re
nonwhite=re.compile(r'\S+',re.U)

html_doc1="""
<article>
<uselesstag2>
<uselesstag1>
</uselesstag1>
</uselesstag2>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<p>21.09.2009</p> 
<p> </p> 
<p1><img src="http://www.www.com/"></p1> 
<p></p> 

<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""

def nothing_inside(thing):
    # select only tags to examine, leave comments/strings
    try:
        # check for img empty tags
        if thing.name=='img' and thing['src']<>'':
            return False
        else:
            pass
        # check if any non-whitespace contents
        for item in thing.contents:
            if nonwhite.match(item):
                return False
            else:
                pass
        return True
    except:
        return False

def scrub(thing):
    # loop function as long as an empty tag exists
    while thing.find_all(nothing_inside,recursive=True) <> []:
        for emptytag in thing.find_all(nothing_inside,recursive=True):
            emptytag.extract()
            scrub(thing)
    return thing

soup=BeautifulSoup(html_doc1)
print scrub(soup)

Result:

<article>

<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>

<p1><img src="http://www.www.com/"/></p1>

<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>

beautifulsoup findall

Tags:

python

xml

beautifulsoup

RNs_Ghost

2 Answers

Arthur Neves

Kao

Recent Activity

Donate For Us

beautifulsoup findall

Tags:

python

xml

beautifulsoup

RNs_Ghost

2 Answers

Arthur Neves

Kao

Related questions

Recent Activity

Donate For Us