I am trying to parse some data from the web using BeautifulSoup. So far I have gotten the data that I need from a table using the following code:
def webParsing(canvas):
url='http://www.cmu.edu/dining/hours/index.html'
try:
page= urllib.urlopen(url)
except:
print 'Error while opening html file. Please ensure that you',
print ' have a working internet connection.'
return
sourceCode=page.read()
soup=BeautifulSoup(sourceCode)
#heading=soup.html.body.div
tableData=soup.table.tbody
parseTable(canvas,tableData)
def parseTable(canvas,tableData):
canvas.data.hoursOfOperation=dict()
rowTag='tr'
colTag='td'
for row in tableData.find_all(rowTag):
row_text=[]
for item in row.find_all(colTag):
text=item.text.strip()
row_text.append(text)
(locations,hoursOpen)=(row_text[0],row_text[1])
locations=locations.split(',')
for location in locations:
canvas.data.hoursOfOperation[location]=hoursOpen
print canvas.data.hoursOfOperation
As you can see, the 'items' in the first column are mapped to those in the second column, using a dictionary. The data is pretty much exactly how I would want it when printed, however in python there is a lot of formatting in these strings such as '\n' or '\xe9' or '\n\xao'. Is there any way to remove all of the formatting? In other words, remove all of the newline characters, anything that represents a specific encoding, anything that represents an accented character, and just get the string literal? I do not need the most efficient or safe method, I am a beginner programmer so preferably the easiest method would be appreciated! Thanks!
Here's a trick: You can encode it to ascii
, and remove all the rest:
>>> 'abc\xe9'.encode('ascii', errors='ignore')
b'abc'
Edit:
Ah, i forgot that you don't want the standard special characters as well. Use this instead:
''.join(s for s in string if ord(s)>31 and ord(s)<126)
Hope this helps!
From this question you can try sometthing like this:
def removeNonAscii(s): return "".join(i for i in s if ord(i)<126 and ord(i)>31)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With