Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How make simple fast request with "requests" module python?

I am a beginner in python,im just trying to scrape web with module requests and BeautifulSoup This Website i make request.

and my simple code:

import requests, time, re, json
from bs4 import BeautifulSoup as BS

url = "https://www.jobstreet.co.id/en/job-search/job-vacancy.php?ojs=6"

def list_jobs():
    try:
        with requests.session() as s:
            st = time.time()
            s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
            req = s.get(url)
            soup = BS(req.text,'html.parser')
            attr = soup.findAll('div',class_='position-title header-text')
            pttr = r".?(.*)Rank=\d+"
            lists = {"status":200,"result":[]}
            for a in attr:
                sr = re.search(pttr, a.find("a")["href"])
                if sr:
                    title = a.find('a')['title'].replace("Lihat detil lowongan -","").replace("\r","").replace("\n","")
                    url = a.find('a')['href']
                    lists["result"].append({
                        "title":title,
                        "url":url,
                        "detail":detail_jobs(url)
                    })
            print(json.dumps(lists, indent=4))
            end = time.time() - st
            print(f"\n{end} second")
    except:
        pass

def detail_jobs(find_url):
    try:
        with requests.session() as s:
            s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
            req = s.get(find_url)
            soup = BS(req.text,'html.parser')
            position = soup.find('h1',class_='job-position').text
            name = soup.find('div',class_='company_name').text.strip("\t")
            try:
                addrs = soup.find('div',class_='map-col-wraper').find('p',{'id':'address'}).text
            except Exception:
                addrs = "Unknown"
            try:
                loct = soup.find('span',{'id':'single_work_location'}).text
            except Exception:
                loct = soup.find('span',{'id':'multiple_work_location_list'}).find('span',{'class':'show'}).text        
            dests = soup.findAll('div',attrs={'id':'job_description'})
            for select in dests:
                txt = select.text if not select.text.startswith("\n") or not select.text.endswith("\n") else select.text.replace("\n","")
                result = {
                    "name":name,
                    "location":loct,
                    "position":position,
                    "description":txt,
                    "address":addrs
                }
                return result
    except:
        pass

they all work well but take very long to show results time is always above 13/17 seconds

i dont know how to increase my speed for requesting

I tried search on stack and google,they said using asyncio but the way so hard to me.

if someone have simple trick how to increase speed with simple do,im so appreciate ..

And sorry for my bad English

like image 561
katro coplax Avatar asked Jan 12 '19 08:01

katro coplax


1 Answers

Learning Python through projects such as web scraping is awesome. That is how I was introduced to Python. That said, to increase the speed of your scraping, you can do three things:

  1. Change the html parser to something faster. 'html.parser' is the slowest of them all. Try change to 'lxml' or 'html5lib'. (read https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

enter image description here

  1. Drop the loops and regex as they slow your script. Just use BeautifulSoup tools, text and strip, and find the right tags.(see my script below)

  2. Since the bottleneck in web scraping is usually IO, waiting to get data from a webpage, using async or multithread will boost speed. In the below script, I have use multithreading. The aim is to pull data from multiple pages at the same time.

So if we know maximum number of pages, we can chunk our requests into different ranges and pull them in batches :)

Code example:

from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

import requests
from bs4 import BeautifulSoup as bs

data = defaultdict(list)

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}

def get_data(data, headers, page=1):

    # Get start time
    start_time = datetime.now()
    url = f'https://www.jobstreet.co.id/en/job-search/job-vacancy/{page}/?src=20&srcr=2000&ojs=6'
    r = requests.get(url, headers=headers)

    # If the requests is fine, proceed
    if r.ok:
        jobs = bs(r.content,'lxml').find('div',{'id':'job_listing_panel'})
        data['title'].extend([i.text.strip() for i in jobs.find_all('div',{'class':'position-title header-text'})])
        data['company'].extend([i.text.strip() for i in jobs.find_all('h3',{'class':'company-name'})])
        data['location'].extend([i['title'] for i in jobs.find_all('li',{'class':'job-location'})] )
        data['desc'].extend([i.text.strip() for i in jobs.find_all('ul',{'class':'list-unstyled hidden-xs '})])
    else:
        print('connection issues')
    print(f'Page: {page} | Time taken {datetime.now()-start_time}')
    return data
    

def multi_get_data(data,headers,start_page=1,end_page=20,workers=20):
    start_time = datetime.now()
    # Execute our get_data in multiple threads each having a different page number
    with ThreadPoolExecutor(max_workers=workers) as executor:
        [executor.submit(get_data, data=data,headers=headers,page=i) for i in range(start_page,end_page+1)]
    
    print(f'Page {start_page}-{end_page} | Time take {datetime.now() -     start_time}')
    return data


# Test page 10-15
k = multi_get_data(data,headers,start_page=10,end_page=15)

Results: enter image description here

Explaining the multi_get_data function:

This function will call get_data function in different threads with passing desired arguments. At the moment, each thread get a different page number to call. The maximum numbers of workers is set to 20, meaning 20 threads. You can increase or decrease accordingly.

We have created variable data, a default dictionary, that takes lists in. All threads will populate this data. This variable can then be cast to json or Pandas DataFrame :)

As you can see, we have 5 requests, each taking less than 2 seconds but yet the total is still under 2 seconds;)

Enjoy web scraping.

Update_: 22/12/2019

We could also gain some speed by using session with a single headers update. So we don’t have to start sessions with each call.

from requests import Session

s = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36'}
# Add headers
s.headers.update(headers)

# we can use s as we do requests
# s.get(...)
...
like image 125
Prayson W. Daniel Avatar answered Oct 06 '22 01:10

Prayson W. Daniel