Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove formatting from strings

I am trying to parse some data from the web using BeautifulSoup. So far I have gotten the data that I need from a table using the following code:

def webParsing(canvas):
url='http://www.cmu.edu/dining/hours/index.html'
try:
    page= urllib.urlopen(url)
except:
    print 'Error while opening html file. Please ensure that you',
    print ' have a working internet connection.'
    return
sourceCode=page.read()
soup=BeautifulSoup(sourceCode)
#heading=soup.html.body.div
tableData=soup.table.tbody
parseTable(canvas,tableData)
def parseTable(canvas,tableData):
    canvas.data.hoursOfOperation=dict()
    rowTag='tr'
    colTag='td'
    for row in tableData.find_all(rowTag):
        row_text=[]
        for item in row.find_all(colTag):
            text=item.text.strip()
            row_text.append(text)
        (locations,hoursOpen)=(row_text[0],row_text[1])
        locations=locations.split(',')
        for location in locations:
            canvas.data.hoursOfOperation[location]=hoursOpen
    print canvas.data.hoursOfOperation

As you can see, the 'items' in the first column are mapped to those in the second column, using a dictionary. The data is pretty much exactly how I would want it when printed, however in python there is a lot of formatting in these strings such as '\n' or '\xe9' or '\n\xao'. Is there any way to remove all of the formatting? In other words, remove all of the newline characters, anything that represents a specific encoding, anything that represents an accented character, and just get the string literal? I do not need the most efficient or safe method, I am a beginner programmer so preferably the easiest method would be appreciated! Thanks!

like image 854
user3029704 Avatar asked Dec 26 '22 17:12

user3029704


2 Answers

Here's a trick: You can encode it to ascii, and remove all the rest:

>>> 'abc\xe9'.encode('ascii', errors='ignore')
b'abc'

Edit:

Ah, i forgot that you don't want the standard special characters as well. Use this instead:

''.join(s for s in string if ord(s)>31 and ord(s)<126)

Hope this helps!

like image 200
aIKid Avatar answered Dec 28 '22 09:12

aIKid


From this question you can try sometthing like this:

def removeNonAscii(s): return "".join(i for i in s if ord(i)<126 and ord(i)>31)
like image 31
4d4c Avatar answered Dec 28 '22 09:12

4d4c