I am trying to extract the year, make, model, colour and plate number from carjam.co.nz. An example of a URL I am scraping from is https://www.carjam.co.nz/car/?plate=JKY242.
If the plate has been recently requested, then the response will be a HTML document with the vehicle details.
Result where the plate details have been recently requested.
If the plate details haven't been recently requested (as is the case with most plates) the response is a HTML document with "Trying to get some vehicle data". I'm guessing that this page displays while the information is fetched from the database, then the page is reloaded to show the vehicle details. This appears to be rendered server-side, I can't see any AJAX requests.
The URL is the same for each result.
Result where the vehicle hasn't been recently requested.
How do I 'wait' for the correct information?
I am using request
(deprecated I know, but it is what I am most comfortable using) on a Node.js with Express server.
My (very reduced) code:
app.get("/:numberPlate", (req, res) => {
request("https://www.carjam.co.nz/car/?plate=" + req.params.numberPlate, function(error, response, body) {
const $ = cheerio.load(body);
res.status(200).send(JSON.stringify({
year: $("[data-key=year_of_manufacture]").next().html(),
make: toTitleCase($("[data-key=make]").next().html()),
model: toTitleCase($("[data-key=model]").next().html()),
colour: toTitleCase($("[data-key=main_colour]").next().html()),
}));
}
}
I have considered:
I'm sure this is a common problem with an easy answer, but I don't have enough experience to figure it out myself, or to know exactly what to Google!
To test: New Zealand has number places in the format "ABC123" – three letters, three numbers. These are released in alphabetical-ish order, currently we have nothing past NLU999 (excluding custom numberplates, numberplates issued out of sequence, etc).
To reproduce the "Trying to get some vehicle data", you need to find a new numberplate each time – most numberplates earlier in the sequence than NLU999 should work.
This code snippet should generate a valid numberplate.
console.log(Math.random().toString(36).replace(/[^a-n]+/g, '').substr(0, 1).toUpperCase() + Math.random().toString(36).replace(/[^a-z]+/g, '').substr(0, 2).toUpperCase() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString());
Upon further thought, this pseudocode could be what I'm after – but unsure how to practically implement.
request(url) {
if (url body contains "Trying to get some vehicle data") {
wait(2 seconds)
request(url again) {
return second_result
}
} else {
return first_result
}
}
then
process(first_result or second_result)
My difficulty here: I am used to the format request().then()
, taking action directly from the request.
Assuming this approach is correct, how would I conduct the following?
From this javascript file, the website loads the page every X seconds if the data is not found with a max retry set to 10. Also the refresh value in seconds is retrieved from the Refresh
http header value.
You can reproduce this flow, so that you have exactly the same behaviour as the frontend code.
In the following example I'm using axios
const axios = require("axios");
const cheerio = require("cheerio");
const rootUrl = "https://www.carjam.co.nz/car/";
const plate = "NLU975";
const maxRetry = 10;
const waitingString = "Waiting for a few more things";
async function getResult() {
return axios.get(rootUrl, {
params: {
plate: plate,
},
});
}
async function processRetry(result) {
const refreshSeconds = parseInt(result.headers["refresh"]);
var retryCount = 0;
while (retryCount < maxRetry) {
console.log(
`retry: ${retryCount} time, waiting for ${refreshSeconds} second(s)`
);
retryCount++;
await timeout(refreshSeconds * 1000);
result = await getResult();
if (!result.data.includes(waitingString)) {
break;
}
}
return result;
}
(async () => {
var result = await getResult();
if (result.data.includes(waitingString)) {
result = await processRetry(result);
}
const $ = cheerio.load(result.data);
console.log({
year: $("[data-key=year_of_manufacture]").next().html(),
make: $("[data-key=make]").next().html(),
model: $("[data-key=model]").next().html(),
colour: $("[data-key=main_colour]").next().html(),
});
})();
function timeout(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
repl.it link: https://replit.com/@bertrandmartel/ScrapeCarJam
Sample output:
retry: 0 time, waiting for 1 second(s)
retry: 1 time, waiting for 1 second(s)
retry: 2 time, waiting for 1 second(s)
{ year: 'XXXX', make: 'XXXXXX', model: 'XX', colour: 'XXXX' }
It uses async/await instead of promise.
Note that request is deprecated
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With