Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python get request returning different HTML than view source

I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need).

First I tried with the built in urllib library (and BeautifulSoup):

import urllib
from bs4 import BeautifulSoup    
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()

Then I found out about the Requests library, and how the User Agent could be part of the problem, so I tried this with the same results:

import requests
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
        'Content-Type': 'text/html',
}
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text

Then I found out about Selenium and PhantomJS, so I installed those and tried this but again - same result:

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.PhantomJS()
browser.get("http://archiveofourown.org/works/6846694")
soup = BeautifulSoup(browser.page_source, "html.parser")
soup.prettify()

Am I doing something wrong in any of these attempts, or is this an issue with the server?

like image 785
Brianna Dardin Avatar asked Apr 17 '26 10:04

Brianna Dardin


1 Answers

The last approach is a step into the right direction if you need the complete page source with all the JavaScript executed and async requests made. You are just missing one thing - you need to give PhantomJS time to load the page before reading the source (pun intentional).

And, you need to also click "Proceed" that you agree to see the adult content:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.get("http://archiveofourown.org/works/6846694")

wait = WebDriverWait(driver, 10)

# click proceed
proceed = wait.until(EC.presence_of_element_located((By.LINK_TEXT, "Proceed")))
proceed.click()

# wait for the content to be present
wait.until(EC.presence_of_element_located((By.ID, "workskin")))

soup = BeautifulSoup(driver.page_source, "html.parser")
soup.prettify()
like image 152
alecxe Avatar answered Apr 19 '26 01:04

alecxe