Can't extract a link connected to `see all` button from a webpage

Question

I've created a script to log in to linkedin using requests. The script is doing fine.

After logging in, I used this url https://www.linkedin.com/groups/137920/ to scrape this name Marketing Intelligence Professionals from there which you can see in this image.

The script can parse the name flawlessly. However, what I wish to do now is scrape the link connected to the See all button located at the bottom of that very page shown in this image.

Group link you gotta log in to access the content

I've created so far (it can scrape the name shown in the first image):

import json
import requests
from bs4 import BeautifulSoup

link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['session_key'] = 'your email' #put your username here
    payload['session_password'] = 'your password' #put your password here
    r = s.post(post_url,data=payload)
    r = s.get(target_url)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
    print(json.loads(items)['data']['name']['text'])

How can I scrape the link connected to See all button from there?

Bertrand Martel · Accepted Answer

There is an internal Rest API which is called when you click on "See All" :

GET https://www.linkedin.com/voyager/api/search/blended

The keywords query parameter contains the title of the group you have requested initially (the group title in the initial page).

In order to get the group name, you could scrape the html of the initial page, but there is an API which returns the group information when you gives the group ID :

GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID

The group id in your case is 137920 which can be extracted from the URL directly

An example :

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode

username = 'your username'
password = 'your password'

link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'

group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)

with requests.Session() as s:
    # login
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['session_key'] = username
    payload['session_password'] = password
    r = s.post(post_url, data = payload)

    # API
    csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
    r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
        headers= {
            "csrf-token": csrf_token
        })
    group_name = r.json()["name"]["text"]
    print(f"searching data for group {group_name}")
    params = {
        "count": 10,
        "keywords": group_name,
        "origin": "SWITCH_SEARCH_VERTICAL",
        "q": "all",
        "start": 0
    }
    r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
        headers= {
            "csrf-token": csrf_token,
            "Accept": "application/vnd.linkedin.normalized+json+2.1",
            "x-restli-protocol-version": "2.0.0"
        })
    result = r.json()["included"]
    print(result)
    print("list of groupName/link")
    print([
        (t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}') 
        for t in result
    ])

A few notes :

those API call require cookie session
those API call require a specific header for a XSRF token that is the same as JSESSIONID cookie value
a special media type application/vnd.linkedin.normalized+json+2.1 is necessary for the search call
the parenthesis inside the fields queryContext and filters shouldn't be url encoded otherwise it will not take these params into account

Can't extract a link connected to `see all` button from a webpage

Tags:

python

python-3.x

beautifulsoup

python-requests

web-scraping

SMTH

1 Answers

Bertrand Martel

Recent Activity

Donate For Us

Can't extract a link connected to `see all` button from a webpage

Tags:

python

python-3.x

beautifulsoup

python-requests

web-scraping

SMTH

1 Answers

Bertrand Martel

Related questions

Recent Activity

Donate For Us