Scraping data from a dynamic web table

Tags:

I want to scrape data from a webpage with a dynamic table. The table contains information on train rides.

This is the website: https://www.laerm-monitoring.de/zug/?mp=3/

I tried to request the data with a simple mounted request session, but I only got basic HTML data without the data from the table.

def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504, 429),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session 

session = requests_retry_session()
response = session.get('https://www.laerm-monitoring.de/zug/?mp=3/')
response.content

How can I do this correctly?

277

asked Apr 24 '21 18:04

3 Answers

The data is loaded dynamically from different URL. You can use this example how to load it just with requests/beautifulsoup:

import json
import requests
from bs4 import BeautifulSoup

data = {
    "sort": "Einfahrtzeit-desc",
    "page": "1",
    "pageSize": "10",
    "group": "",
    "filter": "",
    "__RequestVerificationToken": "",
    "locid": "1",
}

headers = {"X-Requested-With": "XMLHttpRequest"}

url = "https://www.laerm-monitoring.de/zug/"
api_url = "https://www.laerm-monitoring.de/zug/train_read"

with requests.Session() as s:
    soup = BeautifulSoup(s.get(url).content, "html.parser")
    data["__RequestVerificationToken"] = soup.select_one(
        '[name="__RequestVerificationToken"]'
    )["value"]
    data = s.post(api_url, data=data, headers=headers).json()

# pretty print the data
print(json.dumps(data, indent=4))

Prints:

{
    "Data": [
        {
            "id": 2536954,
            "Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 7.3,
            "Zugl\u00e4nge": 181.85884,
            "Geschwindigkeit": 115.57797,
            "Maximalpegel": 88.611084,
            "Vorbeifahrtpegel": 85.421326,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536944,
            "Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 6.3,
            "Zugl\u00e4nge": 211.10226,
            "Geschwindigkeit": 152.60104,
            "Maximalpegel": 91.81743,
            "Vorbeifahrtpegel": 87.95224,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536929,
            "Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 5.3,
            "Zugl\u00e4nge": 104.69964,
            "Geschwindigkeit": 110.10052,
            "Maximalpegel": 82.100815,
            "Vorbeifahrtpegel": 79.98168,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536924,
            "Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 2.9,
            "Zugl\u00e4nge": 49.305683,
            "Geschwindigkeit": 125.18,
            "Maximalpegel": 98.63289,
            "Vorbeifahrtpegel": 97.25019,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536925,
            "Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 0.0,
            "Zugl\u00e4nge": 0.0,
            "Geschwindigkeit": 0.0,
            "Maximalpegel": 0.0,
            "Vorbeifahrtpegel": 0.0,
            "G\u00fcltig": "-"
        },
        {
            "id": 2536911,
            "Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 4.1,
            "Zugl\u00e4nge": 103.97647,
            "Geschwindigkeit": 132.2034,
            "Maximalpegel": 87.111984,
            "Vorbeifahrtpegel": 85.6776,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536907,
            "Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "GZ",
            "Zugkategorie": "G\u00fcterzug",
            "Vorbeifahrtdauer": 23.8,
            "Zugl\u00e4nge": 583.19586,
            "Geschwindigkeit": 95.63598,
            "Maximalpegel": 88.02967,
            "Vorbeifahrtpegel": 85.02115,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536890,
            "Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 3.5,
            "Zugl\u00e4nge": 104.63446,
            "Geschwindigkeit": 160.47487,
            "Maximalpegel": 88.60612,
            "Vorbeifahrtpegel": 86.46721,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536882,
            "Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "GZ",
            "Zugkategorie": "G\u00fcterzug",
            "Vorbeifahrtdauer": 26.6,
            "Zugl\u00e4nge": 653.52515,
            "Geschwindigkeit": 94.59859,
            "Maximalpegel": 91.9396,
            "Vorbeifahrtpegel": 85.50632,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536869,
            "Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 3.3,
            "Zugl\u00e4nge": 87.8222,
            "Geschwindigkeit": 160.01207,
            "Maximalpegel": 91.3928,
            "Vorbeifahrtpegel": 89.54336,
            "G\u00fcltig": "OK"
        }
    ],
    "Total": 8657,
    "AggregateResults": null,
    "Errors": null
}

answered Oct 18 '22 05:10

Andrej Kesely

With a simple GET request you can retrieve the HTML of the landing page.

import requests

response = requests.get('https://www.laerm-monitoring.de/zug/')  # even without query-parameters: ?mp=3/
print( response.content )

Analyze the dynamic requests (browser)

This can also be done in any browser. In the source view (in Win/Linux: CRTL + U or in Mac: CMD + U) you will find the token needed for all subsequent requests against the REST API: __RequestVerificationToken.

It's inside a hidden <input> form-field one this page:

<input name="__RequestVerificationToken" type="hidden" value="CfDJ8B_eKmsiQC9Esc7ZjyC063dp6MzAtP3Sawnrfz3SCqxOMoPCYMV4sjDbrhDbuOsPcLnOiElgqQWTdMxCgfmhNVx1eC6oR81kZT3os2z3DJxtu6H9V7fKt9z9bdSJwB1ACYSSYWHsmPzt-AMWvSk4eYU" />

When the page loads in your browser this token will be used to load the data dynamically (as you already assumed) via JavaScript XMLHttpRequests (XHR).

To view these XHR requests open the Network tab of your browser's developer tools window (shortcut F12):

Chrome: Inspect network activity
Firefox: Network Monitor

browsers dev-tools network tab shows 2 XHR requests

Both requests are fetching the measured data as JSON. For security reasons the called web API requires a token which is sent using a POST request. It's submitted in the body as x-www-form-urlencoded along with the pagination parameters.

See following example from the command-line via cURL:

curl -vi 'https://www.laerm-monitoring.de/zug/train_read' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'sort=Einfahrtzeitdesc&page=1&pageSize=10&group=&filter=&__RequestVerificationToken=CfDJ8...

(token was shortened for illustration purpose)

Hint: in the browser's Network tab you can usually right-click on the request to copy as CURL command.

answered Oct 18 '22 03:10

hc_dev

I have used Selenium to do something similar with python. Not sure if that works for your. Basically open the website and right click on table and do inspect element. After that Go over to the div that the table belongs to and right-click to copy full xpath. After you found the xpath, you can scrape it using selenium. See this answer .

The only problem is that Selenium actually opens the browser and doesn't run in background. I think you can do it silently, but I have never done it.

Another thing is that websites can block you if repeated automated requests come from a single IP. You can use tor to make request from a new IP every time you make a request. I have done something like that with twitter here.

answered Oct 18 '22 03:10

Aditya Singh Rathore

Related questions
                            
                                Can't get rid of unwanted stuff while scraping email addresses
                            
                                Comparison of np.random.choice vs np.random.shuffle for samples without replacement
                            
                                How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased') work??
                            
                                How can I check if a Python collection is ordered?
                            
                                How to config 'Completer.use_jedi' to 'False' in Juypter Notebook permanently
                            
                                How to Deal with Lat/Lon Arrays with Multiple Dimensions?
                            
                                Preform aggregation(s) on multiindex columns
                            
                                Cannot call Python function from Javascript in Notebook
                            
                                Same random numbers in C++ as computed by Python3 numpy.random.rand
                            
                                Writing data from a Python List and a Dictionary to CSV
                            
                                How to implement Grad-CAM on a trained network
                            
                                Poetry could not find a pyproject.toml file in C:\
                            
                                How to serialise and deserialise complex POCO data structures in Python to/from JSON
                            
                                The wikipedia api seems to almost always get the word in question wrong
                            
                                Automatically simplify redundant arithmetic relations
                            
                                lask.cli.NoAppException: While importing "app", an ImportError was raised:
                            
                                Color percentage in image for Python using OpenCV
                            
                                Getting 403 when using Selenium to automate checkout process
                            
                                ImportError: Spatial indexes require either `rtree` or `pygeos` in geopanda but rtree is installed
                            
                                Pandas sort_value() issue. Wrong sorting integer when applied key parameter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping data from a dynamic web table

Tags:

python

python-requests

web-scraping

gython

People also ask