I'm trying to make a desktop notifier, and for that I'm scraping news from a site. When I run the program, I get the following error.
news[child.tag] = child.encode('utf8')
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'encode'
How do I resolve it? I'm completely new to this. I tried searching for solutions, but none of them worked for me.
Here is my code:
import requests
import xml.etree.ElementTree as ET
# url of news rss feed
RSS_FEED_URL = "http://www.hindustantimes.com/rss/topnews/rssfeed.xml"
def loadRSS():
'''
utility function to load RSS feed
'''
# create HTTP request response object
resp = requests.get(RSS_FEED_URL)
# return response content
return resp.content
def parseXML(rss):
'''
utility function to parse XML format rss feed
'''
# create element tree root object
root = ET.fromstring(rss)
# create empty list for news items
newsitems = []
# iterate news items
for item in root.findall('./channel/item'):
news = {}
# iterate child elements of item
for child in item:
# special checking for namespace object content:media
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
else:
news[child.tag] = child.encode('utf8')
newsitems.append(news)
# return news items list
return newsitems
def topStories():
'''
main function to generate and return news items
'''
# load rss feed
rss = loadRSS()
# parse XML
newsitems = parseXML(rss)
return newsitems
You're trying to convert a str
to bytes
, and then store those bytes in a dictionary.
The problem is that the object you're doing this to is an
xml.etree.ElementTree.Element
,
not a str
.
You probably meant to get the text from within or around that element, and then encode()
that.
The docs
suggests using the
itertext()
method:
''.join(child.itertext())
This will evaluate to a str
, which you can then encode()
.
Note that the
text
and tail
attributes
might not contain text
(emphasis added):
Their values are usually strings but may be any application-specific object.
If you want to use those attributes, you'll have to handle None
or non-string values:
head = '' if child.text is None else str(child.text)
tail = '' if child.text is None else str(child.text)
# Do something with head and tail...
Even this is not really enough.
If text
or tail
contain bytes
objects of some unexpected
(or plain wrong)
encoding, this will raise a UnicodeEncodeError
.
I suggest leaving the text as a str
, and not encoding it at all.
Encoding text to a bytes
object is intended as the last step before writing it to a binary file, a network socket, or some other hardware.
For more on the difference between bytes and characters, see Ned Batchelder's "Pragmatic Unicode, or, How Do I Stop the Pain?" (36 minute video from PyCon US 2012). He covers both Python 2 and 3.
Using the child.itertext()
method, and not encoding the strings, I got a reasonable-looking list-of-dictionaries from topStories()
:
[
...,
{'description': 'Ayushmann Khurrana says his five-year Bollywood journey has '
'been “a fun ride”; adds success is a lousy teacher while '
'failure is “your friend, philosopher and guide”.',
'guid': 'http://www.hindustantimes.com/bollywood/i-am-a-hardcore-realist-and-that-s-why-i-feel-my-journey-has-been-a-joyride-ayushmann-khurrana/story-KQDR7gMuvhD9AeQTA7tbmI.html',
'link': 'http://www.hindustantimes.com/bollywood/i-am-a-hardcore-realist-and-that-s-why-i-feel-my-journey-has-been-a-joyride-ayushmann-khurrana/story-KQDR7gMuvhD9AeQTA7tbmI.html',
'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/p2/2017/06/26/Pictures/actor-ayushman-khurana_24f064ae-5a5d-11e7-9d38-39c470df081e.JPG',
'pubDate': 'Mon, 26 Jun 2017 10:50:26 GMT ',
'title': "I am a hardcore realist, and that's why I feel my journey "
'has been a joyride: Ayushmann...'},
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With