Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use python urlopen scraping after a page finish loading all searching result?

I am trying to scrape air ticket info(including plane info and price info, etc.) from http://flight.qunar.com/ using python3 and BeautifulSoup. Below is the python code I am using. In this code I tried to scrape flight info from Beijing(北京) to Lijiang(丽江) at 2012-07-25.

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://flight.qunar.com/site/oneway_list.htm'
values = {'searchDepartureAirport':'北京', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'}
encoded_param = urllib.parse.urlencode(values)
full_url = url + '?' + encoded_param
response = urllib.request.urlopen(full_url)
soup = BeautifulSoup(response)
print(soup.prettify())

What I get is the initial page after submit the request and the page is still loading the search results. What I want is the final page after it finish loading the searching results. So how can I achieve this goal using python?

like image 403
Sam Wei Avatar asked Jul 25 '12 08:07

Sam Wei


People also ask

How do you scrape a URL in Python?

You’ll use a page on Real Python that’s been set up for use with this tutorial. One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen () that can be used to open a URL within a program.

What is Python and web scraping?

This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

How to scrape HTML code from a single web page?

Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial. One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs.

How to scrape infinite scrolling pages using Python?

Let’s see how to scrape infinite scrolling pages using Python with the help of the below-mentioned steps. You need to import the Selenium library. Here you have to choose the browser that you want to use. We will go with Chrome as it offers more options than Firefox. def get_selenium (): options = webdriver.


1 Answers

The problem is actually quite hard - the site uses dynamically generated content that gets loaded via JavaScript, however urllib gets basically only what you would get in a browser if you disabled JavaScript. So, what can we do?

Use

  • Selenium or
  • PhantomJS or
  • Crowbar

to fully render a webpage (they are essentially headless, automated browsers for testing and scraping)

Or, if you want a (semi-)pure Python solution, use PyQt4.QtWebKit to render the page. It works approxiametly like this:

import sys
import signal

from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

url = "http://www.stackoverflow.com"

def page_to_file(page):
    with open("output", 'w') as f:
        f.write(page.mainFrame().toHtml())
        f.close()

app = QApplication()
page = QWebPage()
signal.signal( signal.SIGINT, signal.SIG_DFL )
page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file)
page.mainFrame().load(QUrl(url))
sys.exit( app.exec_() )

Edit: There's a nice explanation how this works here.

Ps: You may want to look into requests instead of using urllib :)

like image 123
Manuel Ebert Avatar answered Oct 01 '22 11:10

Manuel Ebert