I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.)
I initially approached this using BeautifulSoup but it turns out that most of the data is loaded in JavaScript and I'm not sure BeautifulSoup can handle it. If you view source you'll see the relevant data stored in bootstrapData['menuMonthWeeks']
.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())
This is an easy way to get the source and review.
My question is: what is the easiest way to extract this data so that I can do something with it? Literally, all I want is a string something like:
Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberries
I've thought about using webkit to process the page and get the HTML (i.e. what a browser does) but that seems unnecessarily complex. I'd rather simply find something that can parse the bootstrapData['menuMonthWeeks']
data.
Beautiful Soup doesn't mimic a client. Javascript is code that runs on the client. With Python, we simply make a request to the server, and get the server's response, which is the starting text, along of course with the javascript, but it's the browser that reads and runs that javascript. Thus, we need to do that.
Basically, the BeautifulSoup 's text attribute will return a string stripped of any HTML tags and metadata.
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
Something like PhantomJS may be more robust, but here's some basic Python code to extract it the full menu:
import json
import re
import urllib2
text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))
print menu
After that, you'll want to search through the menu for the date you're interested in.
EDIT: Some overkill on my part:
import itertools
import json
import re
import urllib2
text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))
days = itertools.chain.from_iterable(menu['days'] for menu in menus)
day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)
if day:
print '\n'.join(item['food']['description'] for item in day['menu_items'])
else:
print 'Day not found.'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With