I have some xml:
<article>
<uselesstag></uslesstag>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>cars</topic>
<body>body text</body>
</article>
There are many, many useless tags. I want to use beautifulsoup to collect all of the text in the body tags and their associated topic text to create some new xml.
I am new to python, but I suspect that some form of
import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
totstring=""
with open('reut2-000.sgm', 'r') as inF:
for line in inF:
string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
totstring+=string
soup = BeautifulSoup(totstring)
body = soup.find("body")
for anchor in soup.findAll('body'):
#Stick body and its topics in an associated array?
file.close
will work.
1) How do I do it? 2) Should I add a root node to the XML? otherwise it's not proper XML is it?
Thanks very much
Edit:
What i want to end up with is:
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
There are many, many useless tags.
ok. here is the solution,
first, make sure that u had 'beautifulsoup4' installed: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
here is my code to get all body and topic tags:
from bs4 import BeautifulSoup
html_doc= """
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
soup = BeautifulSoup(html_doc)
bodies = [a.get_text() for a in soup.find_all('body')]
topics = [a.get_text() for a in soup.find_all('topic')]
Another way to remove empty xml or html tags is to use a recursive function to search for empty tags and remove them using .extract(). This way, you don't have to manually list out which tags you want to keep. It also enables cleaning of empty tags that are nested.
from bs4 import BeautifulSoup
import re
nonwhite=re.compile(r'\S+',re.U)
html_doc1="""
<article>
<uselesstag2>
<uselesstag1>
</uselesstag1>
</uselesstag2>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>
<p> </p>
<p1><img src="http://www.www.com/"></p1>
<p></p>
<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
def nothing_inside(thing):
# select only tags to examine, leave comments/strings
try:
# check for img empty tags
if thing.name=='img' and thing['src']<>'':
return False
else:
pass
# check if any non-whitespace contents
for item in thing.contents:
if nonwhite.match(item):
return False
else:
pass
return True
except:
return False
def scrub(thing):
# loop function as long as an empty tag exists
while thing.find_all(nothing_inside,recursive=True) <> []:
for emptytag in thing.find_all(nothing_inside,recursive=True):
emptytag.extract()
scrub(thing)
return thing
soup=BeautifulSoup(html_doc1)
print scrub(soup)
Result:
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>
<p1><img src="http://www.www.com/"/></p1>
<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With