<p>I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block. </p> <p>I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that. </p> <p>I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.</p> <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> Cows and Sheep </title> </head> <body> <div id="main"> <div id="main-precontents"> <div id="main-contents" class="main-contents"> <script type="text/javascript"> //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>  get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains  <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>  '; //]]> </script> </div> </div> </div> </body> </html> </code></pre>

<p>One thing you need to be careful of <em>BeautifulSoup grabbing CData</em> is not to use a lxml parser.</p> <p>By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here</p> <pre class="prettyprint"><code>#Trying it with html.parser >>> from bs4 import BeautifulSoup >>> import bs4 >>> s='''<?xml version="1.0" ?> <foo> <bar><![CDATA[ aaaaaaaaaaaaa ]]></bar> </foo>''' >>> soup = BeautifulSoup(s, "html.parser") >>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip() 'aaaaaaaaaaaaa' >>> </code></pre>

How can i grab CData out of BeautifulSoup

Tags:

python

beautifulsoup

cdata

screen-scraping

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.

I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.

I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
<head>  
<title>
   Cows and Sheep
  </title>
</head>
<body>
 <div id="main">
  <div id="main-precontents">
   <div id="main-contents" class="main-contents">
    <script type="text/javascript">
       //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
       <!--ts-->
       get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
       <!--yy-->
       <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
       <!--?5695:5:40:45-->
       ';
        //]]>
      </script>
     </div>
     </div>
    </div>
 </body>
</html>

851

asked Jan 09 '10 02:01

hary wilke

2 Answers

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>

answered Sep 21 '22 11:09

iMath

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

answered Sep 18 '22 11:09

Alex Martelli

Related questions
                            
                                Python, assign function to variable, change optional argument's value
                            
                                How do I delete rows not starting with 'x' in Pandas or keep rows starting with 'x'
                            
                                How to get one record with SQLAlchemy?
                            
                                Sklearn pass fit() parameters to xgboost in pipeline
                            
                                Python Plotly format axis numbers as %
                            
                                Tweepy Truncated Status
                            
                                Pandas df.resample with column-specific aggregation function
                            
                                Using aria-label to locate and click an element with Python3 and Selenium
                            
                                Dynamic choices WTForms Flask SelectField
                            
                                Updating a set while iterating over its elements
                            
                                Where are the inaccuracies in math.sqrt() and math.pow() coming from for large numbers? [duplicate]
                            
                                Django 2 url path matching negative value
                            
                                ProgrammingError: relation "django_session" does not exist
                            
                                What is the python equivalent of JavaScript's Array.prototype.find?
                            
                                check for identical rows in different numpy arrays
                            
                                How to use Flask in Google Colaboratory Python Notebook?
                            
                                Segmentation fault: 11 python after upgrading to OS Big Sur
                            
                                cannot import name 'delayed' from 'sklearn.utils.fixes'
                            
                                python convert microsoft office docs to plain text on linux
                            
                                'getattr(): attribute name must be string' error in admin panel for a model with an ImageField

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With