I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need).
First I tried with the built in urllib library (and BeautifulSoup):
import urllib
from bs4 import BeautifulSoup
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()
Then I found out about the Requests library, and how the User Agent could be part of the problem, so I tried this with the same results:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text
Then I found out about Selenium and PhantomJS, so I installed those and tried this but again - same result:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.PhantomJS()
browser.get("http://archiveofourown.org/works/6846694")
soup = BeautifulSoup(browser.page_source, "html.parser")
soup.prettify()
Am I doing something wrong in any of these attempts, or is this an issue with the server?
The last approach is a step into the right direction if you need the complete page source with all the JavaScript executed and async requests made. You are just missing one thing - you need to give PhantomJS time to load the page before reading the source (pun intentional).
And, you need to also click "Proceed" that you agree to see the adult content:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://archiveofourown.org/works/6846694")
wait = WebDriverWait(driver, 10)
# click proceed
proceed = wait.until(EC.presence_of_element_located((By.LINK_TEXT, "Proceed")))
proceed.click()
# wait for the content to be present
wait.until(EC.presence_of_element_located((By.ID, "workskin")))
soup = BeautifulSoup(driver.page_source, "html.parser")
soup.prettify()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With