Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping javascript-generated data using Python

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow".

I want to scrape the "Cash Flow" data.

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by submitting some option value and cookie to this url.

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

like image 259
trigger Avatar asked Apr 07 '12 06:04

trigger


People also ask

How to scrape a normal website using Python?

There are different ways to scrape any website using Python. Using the BeautifulSoup library, Scrapy Framework, and Selenium library with a headless web browser. Using BeautifulSoup or Scrapy Framework, we can easily scrap a normal website where the server generates the HTML files.

How to scrape data from an HTML page?

Before getting out any information from the HTML of the page, we must understand the structure of the page. This is needed to be done in order to select the desired data from the entire page. We can do this by right-clicking on the page we want to scrape and select inspect element.

Is there a way to load dynamic data in Python without JavaScript?

Many websites will supply data that is dynamically loaded via javascript. In Python, you can make use of jinja templating and do this without javascript, but many websites use javascript to populate data.

Is it possible to do web scraping with copy and paste?

In such situations, copy and paste will not work and that’s where you’ll need web scraping. In this article, we will discuss how to perform web scraping using the requests library and beautifulsoup library in Python.


2 Answers

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

like image 157
Niklas B. Avatar answered Oct 02 '22 13:10

Niklas B.


If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

http://code.google.com/p/selenium/

  • Selenium has Python bindings

  • It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes

  • Get HTML document content after AJAX updates thru Selenium API

  • Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

like image 31
Mikko Ohtamaa Avatar answered Oct 02 '22 12:10

Mikko Ohtamaa