Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't extract a link connected to `see all` button from a webpage

I've created a script to log in to linkedin using requests. The script is doing fine.

After logging in, I used this url https://www.linkedin.com/groups/137920/ to scrape this name Marketing Intelligence Professionals from there which you can see in this image.

The script can parse the name flawlessly. However, what I wish to do now is scrape the link connected to the See all button located at the bottom of that very page shown in this image.

Group link you gotta log in to access the content

I've created so far (it can scrape the name shown in the first image):

import json
import requests
from bs4 import BeautifulSoup

link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['session_key'] = 'your email' #put your username here
    payload['session_password'] = 'your password' #put your password here
    r = s.post(post_url,data=payload)
    r = s.get(target_url)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
    print(json.loads(items)['data']['name']['text'])

How can I scrape the link connected to See all button from there?

like image 778
SMTH Avatar asked Jun 04 '20 07:06

SMTH


1 Answers

There is an internal Rest API which is called when you click on "See All" :

GET https://www.linkedin.com/voyager/api/search/blended

The keywords query parameter contains the title of the group you have requested initially (the group title in the initial page).

In order to get the group name, you could scrape the html of the initial page, but there is an API which returns the group information when you gives the group ID :

GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID

The group id in your case is 137920 which can be extracted from the URL directly

An example :

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode

username = 'your username'
password = 'your password'

link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'

group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)

with requests.Session() as s:
    # login
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['session_key'] = username
    payload['session_password'] = password
    r = s.post(post_url, data = payload)

    # API
    csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
    r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
        headers= {
            "csrf-token": csrf_token
        })
    group_name = r.json()["name"]["text"]
    print(f"searching data for group {group_name}")
    params = {
        "count": 10,
        "keywords": group_name,
        "origin": "SWITCH_SEARCH_VERTICAL",
        "q": "all",
        "start": 0
    }
    r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
        headers= {
            "csrf-token": csrf_token,
            "Accept": "application/vnd.linkedin.normalized+json+2.1",
            "x-restli-protocol-version": "2.0.0"
        })
    result = r.json()["included"]
    print(result)
    print("list of groupName/link")
    print([
        (t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}') 
        for t in result
    ])

A few notes :

  • those API call require cookie session
  • those API call require a specific header for a XSRF token that is the same as JSESSIONID cookie value
  • a special media type application/vnd.linkedin.normalized+json+2.1 is necessary for the search call
  • the parenthesis inside the fields queryContext and filters shouldn't be url encoded otherwise it will not take these params into account
like image 95
Bertrand Martel Avatar answered Oct 17 '22 02:10

Bertrand Martel