Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping with Selenium Python [Twitter + Instagram]

I am trying to web scrape both Instagram and Twitter based on geolocation. I can run a query search but I am having challenges in reloading the web page to to more and store the fields to data-frame.

I did find couple of examples for web scraping twitter and Instagram without API keys. But they are with respect to #tags keywords.

I am trying to scrape with respect to geo location and between old dates. so far I have come this far with writing code in python 3.X and all the latest versions of packages in anaconda.

'''
    Instagram - Components
    "id": "1478232643287060472", 
     "dimensions": {"height": 1080, "width": 1080}, 
     "owner": {"id": "351633262"}, 
     "thumbnail_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/s640x640/sh0.08/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "is_video": false, 
     "code": "BSDvMHOgw_4", 
     "date": 1490439084, 
     "taken-at=213385402"
     "display_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "caption": "Hakuna jambo zuri kama kumpa Mungu shukrani kwa kila jambo.. \ud83d\ude4f\ud83c\udffe\nIts weekend\n#lifeistooshorttobeunhappy\n#Godisgood \n#happysoul \ud83d\ude00", 
     "comments": {"count": 42}, 
     "likes": {"count": 3813}}, 
'''


import selenium
from selenium import webdriver
#from selenium import selenium
from bs4 import BeautifulSoup
import pandas

#geotags = pd.read_csv("geocodes.csv")
#parmalink = 
query = geocode%3A35.68501%2C139.7514%2C30km%20since:2016-03-01%20until:2016-03-02&f=tweets

twitterURL = 'https://twitter.com/search?q=' + query
#instaURL = "https://www.instagram.com/explore/locations/213385402/"


browser = webdriver.Firefox()
browser.get(twitterURL)
content = browser.page_source

soup = BeautifulSoup(content)
print (soup)

For Twitter Search Query I am getting syntax error

For Instagram I am not getting any error but I am not able to reload for more posts and write back to csv dataframe.

I am also trying to search with latitude and longitude search in both Twitter and Instagram.

I have a list of geo coordinates in csv I can use that input or can write a query for search.

Any way to complete the scraping with location will be appreciated.

Appreciate the help !!

like image 376
Sitz Blogz Avatar asked Mar 26 '17 19:03

Sitz Blogz


People also ask

Can you scrape Instagram with Python?

The python package Instagramy is used to scrape Instagram quick and easily. This package is installed by running the following command. Based on the network connection it scrapes the data for you.

Can you automate Instagram using Selenium?

We can automate Instagram login page with Selenium webdriver in Java. To achieve this, first we have to launch the Instagram login page and identify the elements like email, password and login with the findElement method and interact with them.

Is web scraping allowed on Twitter?

Twitter's terms forbid non-permitted web scraping; “scraping the Services without the prior consent of Twitter is expressly prohibited,” but breaking these terms is a civil matter, so it isn't illegal. Twitter data is scraped all the time and problems are rarely reported, if ever.


1 Answers

I managed to make it work using requests. Your code would look something like this:

from bs4 import BeautifulSoup
import requests

query = "geocode%3A35.68501%2C139.7514%2C30km%20since:2016-03-01%20until:2016-03-02&f=tweets"

twitter = 'https://twitter.com/search?q=' + query

content = requests.get(twitter)
soup = BeautifulSoup(content.text)

print(soup)

Then you can use the soup object to parse what you need. The same thing should work for Instagram, if your query is correct.

like image 137
Fernando Cezar Avatar answered Sep 27 '22 16:09

Fernando Cezar