Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Feeds from FeedParser and Import to Pandas DataFrame

I'm learning python. As practice I'm building a rss scraper with feedparser putting the output into a pandas dataframe and trying to mine with NLTK...but I'm first getting a list of articles from multiple RSS feeds.

I used this post on how to pass multiple feeds and combined it with an answer I got previously to another question on how to get it into a Pandas dataframe.

What the problem is, I want to be able to see the data from all the feeds in my dataframe. Currently I'm only able to access the first item in the list of feeds.

FeedParser seems to be doing it's job but when putting it into the Pandas df it only seems to grab the first RSS in the list.

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = []
for url in rawrss:
    feeds.append(feedparser.parse(url))

for feed in feeds:
    for post in feed.entries:
        print(post.title, post.link, post.summary)

df = pd.DataFrame(columns=['title', 'link', 'summary'])

for i, post in enumerate(feed.entries):
    df.loc[i] =  post.title, post.link, post.summary

df.shape

df
like image 455
Nick Duddy Avatar asked Dec 13 '22 21:12

Nick Duddy


1 Answers

Your code will loop through each post and print its data. The part of your code that adds the post data to the dataframe is not part of the loop (in python indentation is meaningful!), so you only see the data from one feed in your dataframe.

You can build a list of posts as you loop through the feeds, and then create a dataframe at the end:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = [] # list of feed objects
for url in rawrss:
    feeds.append(feedparser.parse(url))

posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

You could optimize this a little bit by combining the two for loops:

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))
like image 79
beenjaminnn Avatar answered Dec 16 '22 11:12

beenjaminnn