Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to produce unique selector for an element that won't get invalid elements on refresh or if DOM position changes?

I'm scraping this website's user profiles using puppeteer. I have a list of profile links that I use to go to each profile page and scrape twitter link, youtube link, and other information of each user.

example profiles

  • https://www.tradingview.com/u/QuantNomad/ - has youtube, twitter, website but not location
  • https://www.tradingview.com/u/CryptoRox/ - has twitter, website but location but not youtube

This is the profile that I use to generate a unique selector for twitter, youtube, and the website link.

I use the chrome devtools to get the unique-selector and the selector for youtube looks like this

youtube scraping

But in the other profile I shared, that doesn't have a youtube link, gets the twitter link but I want it to be empty if the youtube link is not there.

getting twitter link instead of youtbe

Not all users have a youtube link or twitter link etc. And so these unique selectors are getting wrong data in different profiles.

I know the selectors are just doing their job by getting the 4th item(because the selector is a:nth-child(4)) but how can I get a unique selector that will return only that kind of data e.g youtube selector gets youtube link and if there's no link then it returns nothing and so one.

And also keep in mind that the links can be random, take website links for example, each user has a different website link so you can't match the href or innerText with a predefined keyword.

like image 526
Ruhul Amin Avatar asked Dec 30 '22 22:12

Ruhul Amin


2 Answers

For the location, the <span> element right before, where the marker icon lives, has a quite unique class tv-profile__title-info-icon--place, so you can grab that location textnode with

const loc = document.querySelector('.tv-profile__title-info-icon--place').nextSibling.textContent;

For the anchor elements you know they will differ by their href attribute (that's why you want it right?), so you can use that as a selector. For instance

  • twitter link: a[href*="://twitter.com/"]
  • youtube link: a[href*="://www.youtube.com/"]

And the one link that won't match will be the personal site link:

a.tv-profile__title-info-item:not([href*="://twitter.com"]):not([href*="://www.youtube.com"])
like image 159
Kaiido Avatar answered Jan 05 '23 16:01

Kaiido


If the list of external links is finite you could check if each of them is present by giving the querySelector a part of URL of external site:

document.querySelector('.tv-profile__title-info-item[href^="https://www.youtube.com"]')
like image 29
Vaviloff Avatar answered Jan 05 '23 18:01

Vaviloff