Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup import feedparser import urllib  # Dictionaries links = {} titles = {}  # Variables n = 0  rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"  # Parse the RSS feed feed = feedparser.parse(rss_url)  # view the entire feed, one entry at a time for post in feed.entries:     # Create variables from posts     link = post.link     title = post.title     # Add the link to the dictionary     n += 1     links[n] = link  for k,v in links.items():     # Open RSS feed     page = urllib.urlopen(v).read()     page = str(page)     soup = BeautifulSoup(page)      # Find all of the text between paragraph tags and strip out the html     page = soup.find('p').getText()      # Strip ampersand codes and WATCH:     page = re.sub('&\w+;','',page)     page = re.sub('WATCH:','',page)      # Print Page     print(page)     print(" ")      # To stop after 3rd article, just whilst testing ** to be removed **     if (k >= 3):         break 

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py") ​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.  The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).  The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,  >>>  

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

like image 399
Darren Wadley Avatar asked Sep 17 '12 00:09

Darren Wadley


People also ask

Is navigable string editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.

How do you get a href value in BeautifulSoup?

To get href with Python BeautifulSoup, we can use the find_all method. to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True .


1 Answers

You are getting close!

# Find all of the text between paragraph tags and strip out the html page = soup.find('p').getText() 

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'}) 

to zero in on the body of the article.

like image 167
Amanda Avatar answered Sep 19 '22 16:09

Amanda