I am trying to scrape air ticket info(including plane info and price info, etc.) from http://flight.qunar.com/ using python3 and BeautifulSoup. Below is the python code I am using. In this code I tried to scrape flight info from Beijing(北京) to Lijiang(丽江) at 2012-07-25.
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://flight.qunar.com/site/oneway_list.htm'
values = {'searchDepartureAirport':'北京', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'}
encoded_param = urllib.parse.urlencode(values)
full_url = url + '?' + encoded_param
response = urllib.request.urlopen(full_url)
soup = BeautifulSoup(response)
print(soup.prettify())
What I get is the initial page after submit the request and the page is still loading the search results. What I want is the final page after it finish loading the searching results. So how can I achieve this goal using python?
You’ll use a page on Real Python that’s been set up for use with this tutorial. One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen () that can be used to open a URL within a program.
This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial. One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs.
Let’s see how to scrape infinite scrolling pages using Python with the help of the below-mentioned steps. You need to import the Selenium library. Here you have to choose the browser that you want to use. We will go with Chrome as it offers more options than Firefox. def get_selenium (): options = webdriver.
The problem is actually quite hard - the site uses dynamically generated content that gets loaded via JavaScript, however urllib
gets basically only what you would get in a browser if you disabled JavaScript. So, what can we do?
Use
to fully render a webpage (they are essentially headless, automated browsers for testing and scraping)
Or, if you want a (semi-)pure Python solution, use PyQt4.QtWebKit
to render the page. It works approxiametly like this:
import sys
import signal
from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
url = "http://www.stackoverflow.com"
def page_to_file(page):
with open("output", 'w') as f:
f.write(page.mainFrame().toHtml())
f.close()
app = QApplication()
page = QWebPage()
signal.signal( signal.SIGINT, signal.SIG_DFL )
page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file)
page.mainFrame().load(QUrl(url))
sys.exit( app.exec_() )
Edit: There's a nice explanation how this works here.
Ps: You may want to look into requests instead of using urllib
:)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With