BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Tags:

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup import feedparser import urllib  # Dictionaries links = {} titles = {}  # Variables n = 0  rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"  # Parse the RSS feed feed = feedparser.parse(rss_url)  # view the entire feed, one entry at a time for post in feed.entries:     # Create variables from posts     link = post.link     title = post.title     # Add the link to the dictionary     n += 1     links[n] = link  for k,v in links.items():     # Open RSS feed     page = urllib.urlopen(v).read()     page = str(page)     soup = BeautifulSoup(page)      # Find all of the text between paragraph tags and strip out the html     page = soup.find('p').getText()      # Strip ampersand codes and WATCH:     page = re.sub('&\w+;','',page)     page = re.sub('WATCH:','',page)      # Print Page     print(page)     print(" ")      # To stop after 3rd article, just whilst testing ** to be removed **     if (k >= 3):         break

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py") Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.  The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).  The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,  >>>

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

399

asked Sep 17 '12 00:09

Darren Wadley

1 Answers

You are getting close!

# Find all of the text between paragraph tags and strip out the html page = soup.find('p').getText()

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

to zero in on the body of the article.

167

answered Sep 19 '22 16:09

Amanda

Related questions
                            
                                How to calculate mean values grouped on another column in Pandas
                            
                                Python: dynamically create function at runtime
                            
                                Is there a tutorial specifically for PyQt5? [closed]
                            
                                Django: change the value of a field for all objects in a queryset
                            
                                How to change data points color based on some variable
                            
                                Correct style for line breaks when chaining methods in Python
                            
                                Remove NaN/NULL columns in a Pandas dataframe?
                            
                                Pragmas in python
                            
                                Are NumPy's math functions faster than Python's?
                            
                                Split requirements files in pip
                            
                                Django Rest Framework and JSONField
                            
                                What is the equivalent of "zip()" in Python's numpy?
                            
                                How do I stop getting ImportError: Could not import settings 'mofin.settings' when using django with wsgi?
                            
                                How do I install SciPy on 64 bit Windows?
                            
                                Python subprocess: callback when cmd exits
                            
                                yield break in Python
                            
                                Python 3.7 - asyncio.sleep() and time.sleep()
                            
                                Import a Python module into a Jinja template?
                            
                                How to call function that takes an argument in a Django template?
                            
                                How to read numbers from file in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Tags:

python

beautifulsoup

python-2.7

Darren Wadley

People also ask

1 Answers

Amanda

Recent Activity

Donate For Us