Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautifulsoup findall

I have some xml:

<article>
<uselesstag></uslesstag>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<uselesstag></uslesstag>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<uselesstag></uslesstag>
<topic>cars</topic>
<body>body text</body>
</article>

There are many, many useless tags. I want to use beautifulsoup to collect all of the text in the body tags and their associated topic text to create some new xml.

I am new to python, but I suspect that some form of

import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO

import BeautifulSoup
from BeautifulSoup import BeautifulSoup

totstring=""

with open('reut2-000.sgm', 'r') as inF:
    for line in inF:
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
    totstring+=string


soup = BeautifulSoup(totstring)

body = soup.find("body")



for anchor in soup.findAll('body'):
    #Stick body and its topics in an associated array?




file.close

will work.

1) How do I do it? 2) Should I add a root node to the XML? otherwise it's not proper XML is it?

Thanks very much

Edit:

What i want to end up with is:

<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<topic>cars</topic>
<body>body text</body>
</article>

There are many, many useless tags.

like image 993
RNs_Ghost Avatar asked May 09 '12 15:05

RNs_Ghost


2 Answers

ok. here is the solution,

first, make sure that u had 'beautifulsoup4' installed: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

here is my code to get all body and topic tags:

from bs4 import BeautifulSoup
html_doc= """
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<article>
<topic>food</topic>
<body>body text</body>
</article>

<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
soup = BeautifulSoup(html_doc)

bodies = [a.get_text() for a in soup.find_all('body')]
topics = [a.get_text() for a in soup.find_all('topic')]
like image 194
Arthur Neves Avatar answered Nov 17 '22 05:11

Arthur Neves


Another way to remove empty xml or html tags is to use a recursive function to search for empty tags and remove them using .extract(). This way, you don't have to manually list out which tags you want to keep. It also enables cleaning of empty tags that are nested.

from bs4 import BeautifulSoup
import re
nonwhite=re.compile(r'\S+',re.U)

html_doc1="""
<article>
<uselesstag2>
<uselesstag1>
</uselesstag1>
</uselesstag2>
<topic>oil, gas</topic>
<body>body text</body>
</article>

<p>21.09.2009</p> 
<p> </p> 
<p1><img src="http://www.www.com/"></p1> 
<p></p> 

<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""

def nothing_inside(thing):
    # select only tags to examine, leave comments/strings
    try:
        # check for img empty tags
        if thing.name=='img' and thing['src']<>'':
            return False
        else:
            pass
        # check if any non-whitespace contents
        for item in thing.contents:
            if nonwhite.match(item):
                return False
            else:
                pass
        return True
    except:
        return False

def scrub(thing):
    # loop function as long as an empty tag exists
    while thing.find_all(nothing_inside,recursive=True) <> []:
        for emptytag in thing.find_all(nothing_inside,recursive=True):
            emptytag.extract()
            scrub(thing)
    return thing

soup=BeautifulSoup(html_doc1)
print scrub(soup)

Result:

<article>

<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>

<p1><img src="http://www.www.com/"/></p1>

<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
like image 36
Kao Avatar answered Nov 17 '22 06:11

Kao