Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape Instagram with BeautifulSoup

I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?

Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.

Any idea how I can get the code that shows up in element inspector?

Just for the record, this was my code to start, which didn't work because the unordered list was not there:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
    print x

Thank you for your help.

like image 797
Frank Bi Avatar asked Aug 08 '13 15:08

Frank Bi


1 Answers

If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.

So, for example, this is how it would look with Selenium:

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

for x in soup.findAll('li', {'class':'photo'}):
    print x

Now the soup should be what you are expecting.

like image 179
mr2ert Avatar answered Oct 17 '22 06:10

mr2ert