Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exceptions with DateTime parsing in RSS feed use SyndicationFeed in c#

Tags:

c#

rss

I'm trying to parse Rss2, Atom feeds using SyndicationFeed objects. But I'm getting XmlExceptions while parsing DateTime field like pubDate

2012-01-17 08:01:06

public static List<SyndicationItem> getRssData(string url)
{
    List<SyndicationItem> list = new List<SyndicationItem>();

    WebClient client = new WebClient();
    try
    {
        SyndicationFeed feed = SyndicationFeed.Load(XmlReader.Create(url));
        list = (from item in feed.Items select item).ToList();
    }
    catch (Exception e)
    {
        throw e;
    }

    return list;
}

The url link http://news.163.com/special/00011K6L/rss_newstop.xml

<item id="2">
    <title>...</title>
    <link>...</link>
    <description>......</description>
    <pubDate>2012-01-17 12:09:29</pubDate><-----Exception
</item>

Is there a better way to achieve this? Please help. Thanks.

like image 526
wangyan9110 Avatar asked Jan 17 '12 07:01

wangyan9110


2 Answers

There is a workaround RSS20FeedFormatter throws exception trying to read some DateTime formats.

To work around this problem, create a custom XML reader that recognizes different date formats. The following is an example of a custom XML reader:

XmlReader r = new MyXmlReader(url);
SyndicationFeed feed = SyndicationFeed.Load(r);
Rss20FeedFormatter rssFormatter = feed.GetRss20Formatter();
XmlTextWriter rssWriter = new XmlTextWriter("rss.xml", Encoding.UTF8);
rssWriter.Formatting = Formatting.Indented;
rssFormatter.WriteTo(rssWriter);
rssWriter.Close();

..and class used in previous code:

class MyXmlReader : XmlTextReader
{
    private bool readingDate = false;
    const string CustomUtcDateTimeFormat = "ddd MMM dd HH:mm:ss Z yyyy"; // Wed Oct 07 08:00:07 GMT 2009

    public MyXmlReader(Stream s) : base(s) { }

    public MyXmlReader(string inputUri) : base(inputUri) { }

    public override void ReadStartElement()
    {
        if (string.Equals(base.NamespaceURI, string.Empty, StringComparison.InvariantCultureIgnoreCase) &&
            (string.Equals(base.LocalName, "lastBuildDate", StringComparison.InvariantCultureIgnoreCase) ||
            string.Equals(base.LocalName, "pubDate", StringComparison.InvariantCultureIgnoreCase)))
        {
            readingDate = true;
        }
        base.ReadStartElement();
    }

    public override void ReadEndElement()
    {
        if (readingDate)
        {
            readingDate = false;
        }
        base.ReadEndElement();
    }

    public override string ReadString()
    {
        if (readingDate)
        {
            string dateString = base.ReadString();
            DateTime dt;
            if(!DateTime.TryParse(dateString,out dt))
                dt = DateTime.ParseExact(dateString, CustomUtcDateTimeFormat, CultureInfo.InvariantCulture);
            return dt.ToUniversalTime().ToString("R", CultureInfo.InvariantCulture);
        }
        else
        {
            return base.ReadString();
        }
    }
}
like image 158
Michał Powaga Avatar answered Oct 16 '22 14:10

Michał Powaga


Basically, that RSS feed is invalid. If you look at the RSS 2.0 specification it states that:

All date-times in RSS conform to the Date and Time Specification of RFC 822, with the exception that the year may be expressed with two characters or four characters (four preferred).

The string "2012-01-17 12:09:29" doesn't comply to the "Date and Time" part of RFC 822. It should be "17 01 2012 12:09:29" or something similar.

like image 45
Jon Skeet Avatar answered Oct 16 '22 15:10

Jon Skeet