Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:

feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")

for item in feed:  
 print repr(item.title[0:-1])

I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.

A particular repr(item.title[0:-1]) printed in the interactive window looks like this:

'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'

The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.

I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):


     list = bandRaw,venue,date,latLong  
     for item in feed:  
      parse item.title for bandRaw, venue, date  
       if bandRaw == str(band)   
        send venue name + ", Dallas, TX" to google for geocoding  
        return lat,long  
      list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long  
     else  

In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:

band,venue,date,lat,long  
randy travis,Billy Bobs,3/21,1234.5678,1234.5678  
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765

I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.

So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?

like image 326
Alan Avatar asked Nov 29 '22 12:11

Alan


1 Answers

Don't let regex scare you off... it's well worth learning.

Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:

import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()

('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')

To get at each group individual, just call them on the info object:

print info.group(1) # or info.groups()[0]

print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"

The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.

The pattern above breaks down as follows, which is parsed left to right:

([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.

\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.

([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.

(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.

\) : Closing parenthesis for the above.

The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.

EDIT: See zacherates below, who has some nice edits. Two heads are better than one!

like image 152
Jarret Hardie Avatar answered Dec 13 '22 07:12

Jarret Hardie