Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping data from investing.com for BTC/ETH using BeautifulSoup

I have written some code to scrape BTC/ETH time series from investing.com and it works fine. However I need to alter the requests call so that the downloaded data is from Kraken not the bitfinex default and from 01/06/2016 instead of the default start time. This options can be set manually on the web page but I have no idea how to send that via the requests call except that it may involve using a the "data" parameter. Grateful for any advice.

Thanks,

KM

Code already written in python and works fine for defaults

import requests
from bs4 import BeautifulSoup
import os
import numpy as np

# BTC scrape https://www.investing.com/crypto/bitcoin/btc-usd-historical-data
# ETH scrape https://www.investing.com/crypto/ethereum/eth-usd-historical-data

ticker_list = [x.strip() for x in open("F:\\System\\PVWAVE\\Crypto\\tickers.txt", "r").readlines()]
urlheader = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

print("Number of tickers: ", len(ticker_list))

for ticker in ticker_list:
    print(ticker)
    url = "https://www.investing.com/crypto/"+ticker+"-historical-data"
    req = requests.get(url, headers=urlheader, data=payload)
    soup = BeautifulSoup(req.content, "lxml")

    table = soup.find('table', id="curr_table")
    split_rows = table.find_all("tr")

    newticker=ticker.replace('/','\\')

    output_filename = "F:\\System\\PVWAVE\\Crypto\\{0}.csv".format(newticker)
    os.makedirs(os.path.dirname(output_filename), exist_ok=True)
    output_file = open(output_filename, 'w')
    header_list = split_rows[0:1]
    split_rows_rev = split_rows[:0:-1]

    for row in header_list:
        columns = list(row.stripped_strings)
        columns = [column.replace(',','') for column in columns]
        if len(columns) == 7:
            output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} \n".format(columns[0], columns[2], columns[3], columns[4], columns[1], columns[5], columns[6]))

    for row in split_rows_rev:
        columns = list(row.stripped_strings)
        columns = [column.replace(',','') for column in columns]
        if len(columns) == 7:
            output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} \n".format(columns[0], columns[2], columns[3], columns[4], columns[1], columns[5], columns[6]))

    output_file.close()

Data is downloaded for default exchange and default date range but I want to specify Kraken and default start and end times (01/06/16 and last full day ie always yesterday)

like image 258
SlartyBartFast Avatar asked Jan 26 '23 23:01

SlartyBartFast


1 Answers

Little background

There are lots of websites out there that use something called forms to send data to the server, based on user activity (like log-in pages where you fill your user-name and password) or when you click on a button. Something like that is going on here.

How did I know it?

  • Change the default page and go over to the Kraken historical data page. You will see that the url has changed to https://www.investing.com/crypto/bitcoin/btc-usd-historical-data?cid=49799.
  • Now, right click on the page and click on Inspect. Look at the top row of the split screen that just opened closely. Click on the Networks tab. This tab will show you the request/response cycle of any web page that you visit in the browser.
  • Search for the Clear button just beside the red button that you see and click it. Now, you have a clean slate. You will be able to see the request being sent to the server when you change the date on that page.
  • Change the dates according to your need and then Click on Apply. You will see that a request by the name HistoricalDataAjax was sent to the server(Refer the attached image below for more clarity). Click on it and scroll down in the Headers tab. You can see a section called Form Data. This is the extra hidden(yet-not-so-hidden) information that is being sent to the server. It is being sent as a POST request since you do not see any change in the url.
  • You can also see in the same Headers section that the Request URL is https://www.investing.com/instruments/HistoricalDataAjax

Under Networks tab view

What to do now?

You need to be smart and make 3 changes in your python code.

  • Change the request from GET to POST.

  • Send the Form Data as payload for that request.

  • Change the url to the one you just saw in the Headers tab.

    url = "https://www.investing.com/instruments/HistoricalDataAjax"

    payload = {'header': 'BTC/USD Kraken Historical Data', 'st_date': '12/01/2018', 'end_date': '12/01/2018', 'sort_col': 'date', 'action': 'historical_data', 'smlID': '145284', 'sort_ord': 'DESC', 'interval_sec': 'Daily', 'curr_id': '49799'}

    requests.post(url, data=payload, headers=urlheader)

Make the above mentioned changes and let other parts of your code remain the same. You will get the results you want. You can modify the dates according to your needs too.

like image 166
Siddhartha Avatar answered Jan 29 '23 12:01

Siddhartha