Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using request(), returned page doesn't contain needed data yet – incomplete page is returned instead. How do I 'wait'?

I am trying to extract the year, make, model, colour and plate number from carjam.co.nz. An example of a URL I am scraping from is https://www.carjam.co.nz/car/?plate=JKY242.

If the plate has been recently requested, then the response will be a HTML document with the vehicle details.

Result where the plate details have been recently requested. Result where the plate details have been recently requested.

If the plate details haven't been recently requested (as is the case with most plates) the response is a HTML document with "Trying to get some vehicle data". I'm guessing that this page displays while the information is fetched from the database, then the page is reloaded to show the vehicle details. This appears to be rendered server-side, I can't see any AJAX requests.

The URL is the same for each result.

Result where the plate details haven't been recently requested. Result where the vehicle hasn't been recently requested.

How do I 'wait' for the correct information?

I am using request (deprecated I know, but it is what I am most comfortable using) on a Node.js with Express server.

My (very reduced) code:

app.get("/:numberPlate", (req, res) => {
  request("https://www.carjam.co.nz/car/?plate=" + req.params.numberPlate, function(error, response, body) {
    const $ = cheerio.load(body);
    res.status(200).send(JSON.stringify({
      year: $("[data-key=year_of_manufacture]").next().html(),
      make: toTitleCase($("[data-key=make]").next().html()),
      model: toTitleCase($("[data-key=model]").next().html()),
      colour: toTitleCase($("[data-key=main_colour]").next().html()),
  }));
  }
}

I have considered:

  • Making a request and discarding it, sleeping for 2 - 3 seconds, then making a second request. The advantage of this approach is that every request would work. Disadvantage is that every request takes 2 - 3 second (too slow).
  • Making a request and checking to see if the body contains "Trying to get some vehicle data". If so, sleep a few seconds, make another request and take action on the result of that second request (but how?).

I'm sure this is a common problem with an easy answer, but I don't have enough experience to figure it out myself, or to know exactly what to Google!


To test: New Zealand has number places in the format "ABC123" – three letters, three numbers. These are released in alphabetical-ish order, currently we have nothing past NLU999 (excluding custom numberplates, numberplates issued out of sequence, etc).

To reproduce the "Trying to get some vehicle data", you need to find a new numberplate each time – most numberplates earlier in the sequence than NLU999 should work.

This code snippet should generate a valid numberplate.

console.log(Math.random().toString(36).replace(/[^a-n]+/g, '').substr(0, 1).toUpperCase() + Math.random().toString(36).replace(/[^a-z]+/g, '').substr(0, 2).toUpperCase() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString());

05 May 2021 update

Upon further thought, this pseudocode could be what I'm after – but unsure how to practically implement.

request(url) {
  if (url body contains "Trying to get some vehicle data") {
    wait(2 seconds)
    request(url again) {
      return second_result
    }
  } else {
    return first_result
  }
}
then
  process(first_result or second_result)

My difficulty here: I am used to the format request().then(), taking action directly from the request.

Assuming this approach is correct, how would I conduct the following?

  1. Send the request, then
  2. Assess the response, then
  3. Pass this response on, or send another request then pass that response on
  4. Process response
like image 240
Anonymous Avatar asked Apr 29 '21 22:04

Anonymous


Video Answer


1 Answers

From this javascript file, the website loads the page every X seconds if the data is not found with a max retry set to 10. Also the refresh value in seconds is retrieved from the Refresh http header value.

You can reproduce this flow, so that you have exactly the same behaviour as the frontend code.

In the following example I'm using axios

const axios = require("axios");
const cheerio = require("cheerio");

const rootUrl = "https://www.carjam.co.nz/car/";
const plate = "NLU975";
const maxRetry = 10;
const waitingString = "Waiting for a few more things";

async function getResult() {
  return axios.get(rootUrl, {
    params: {
      plate: plate,
    },
  });
}

async function processRetry(result) {
  const refreshSeconds = parseInt(result.headers["refresh"]);
  var retryCount = 0;
  while (retryCount < maxRetry) {
    console.log(
      `retry: ${retryCount} time, waiting for ${refreshSeconds} second(s)`
    );
    retryCount++;
    await timeout(refreshSeconds * 1000);
    result = await getResult();
    if (!result.data.includes(waitingString)) {
      break;
    }
  }
  return result;
}

(async () => {
  var result = await getResult();
  if (result.data.includes(waitingString)) {
    result = await processRetry(result);
  }
  const $ = cheerio.load(result.data);
  console.log({
    year: $("[data-key=year_of_manufacture]").next().html(),
    make: $("[data-key=make]").next().html(),
    model: $("[data-key=model]").next().html(),
    colour: $("[data-key=main_colour]").next().html(),
  });
})();

function timeout(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

repl.it link: https://replit.com/@bertrandmartel/ScrapeCarJam

Sample output:

retry: 0 time, waiting for 1 second(s)
retry: 1 time, waiting for 1 second(s)
retry: 2 time, waiting for 1 second(s)
{ year: 'XXXX', make: 'XXXXXX', model: 'XX', colour: 'XXXX' }

It uses async/await instead of promise.

Note that request is deprecated

like image 112
Bertrand Martel Avatar answered Nov 14 '22 23:11

Bertrand Martel