Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing BeautifulSoup (Python) code

I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me with this?

I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database.

def parse(self,text):                
    soup = BeautifulSoup(text)
    arr = soup.findAll('tbody')                

    for i in range(0,len(arr)-1):
        data=Data()
        soup2 = BeautifulSoup(str(arr[i]))
        arr2 = soup2.findAll('td')

        c=0
        for j in arr2:                                       
            if str(j).find("<a href=") > 0:
                data.sourceURL = self.getAttributeValue(str(j),'<a href="')
            else:  
                if c == 2:
                    data.Hits=j.renderContents()

            #and few others...

            c = c+1

            data.save()

Any suggestions?

Note: I already ask this question here but that was closed due to incomplete information.

like image 882
developer Avatar asked Apr 26 '10 09:04

developer


1 Answers

soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')

Don't do this: Just call arr2 = arr[i].findAll('td') instead.


This will also be slow:

if str(j).find("<a href=") > 0:
    data.sourceURL = self.getAttributeValue(str(j),'<a href="')

Assuming that getAttributeValue gives you the href attribute, use this instead:

a = j.find('a', href=True)       #find first <a> with href attribute
if a:
    data.sourceURL = a['href']
else:
    #....

In general, you shouldn't need to convert the BeautifulSoup object back into a string if all you want to do is parse it and extract values. Since the find and findAll methods give you back searchable objects, you can keep searching by invoking the find/findAll/etc. methods on the results.

like image 64
interjay Avatar answered Nov 15 '22 21:11

interjay