I am a beginner in python,im just trying to scrape web with module requests
and BeautifulSoup
This Website i make request.
and my simple code:
import requests, time, re, json
from bs4 import BeautifulSoup as BS
url = "https://www.jobstreet.co.id/en/job-search/job-vacancy.php?ojs=6"
def list_jobs():
try:
with requests.session() as s:
st = time.time()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
req = s.get(url)
soup = BS(req.text,'html.parser')
attr = soup.findAll('div',class_='position-title header-text')
pttr = r".?(.*)Rank=\d+"
lists = {"status":200,"result":[]}
for a in attr:
sr = re.search(pttr, a.find("a")["href"])
if sr:
title = a.find('a')['title'].replace("Lihat detil lowongan -","").replace("\r","").replace("\n","")
url = a.find('a')['href']
lists["result"].append({
"title":title,
"url":url,
"detail":detail_jobs(url)
})
print(json.dumps(lists, indent=4))
end = time.time() - st
print(f"\n{end} second")
except:
pass
def detail_jobs(find_url):
try:
with requests.session() as s:
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
req = s.get(find_url)
soup = BS(req.text,'html.parser')
position = soup.find('h1',class_='job-position').text
name = soup.find('div',class_='company_name').text.strip("\t")
try:
addrs = soup.find('div',class_='map-col-wraper').find('p',{'id':'address'}).text
except Exception:
addrs = "Unknown"
try:
loct = soup.find('span',{'id':'single_work_location'}).text
except Exception:
loct = soup.find('span',{'id':'multiple_work_location_list'}).find('span',{'class':'show'}).text
dests = soup.findAll('div',attrs={'id':'job_description'})
for select in dests:
txt = select.text if not select.text.startswith("\n") or not select.text.endswith("\n") else select.text.replace("\n","")
result = {
"name":name,
"location":loct,
"position":position,
"description":txt,
"address":addrs
}
return result
except:
pass
they all work well but take very long to show results time is always above 13/17 seconds
i dont know how to increase my speed for requesting
I tried search on stack and google,they said using asyncio but the way so hard to me.
if someone have simple trick how to increase speed with simple do,im so appreciate ..
And sorry for my bad English
Learning Python through projects such as web scraping is awesome. That is how I was introduced to Python. That said, to increase the speed of your scraping, you can do three things:
Drop the loops and regex as they slow your script. Just use BeautifulSoup tools, text and strip, and find the right tags.(see my script below)
Since the bottleneck in web scraping is usually IO, waiting to get data from a webpage, using async or multithread will boost speed. In the below script, I have use multithreading. The aim is to pull data from multiple pages at the same time.
So if we know maximum number of pages, we can chunk our requests into different ranges and pull them in batches :)
Code example:
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
import requests
from bs4 import BeautifulSoup as bs
data = defaultdict(list)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
def get_data(data, headers, page=1):
# Get start time
start_time = datetime.now()
url = f'https://www.jobstreet.co.id/en/job-search/job-vacancy/{page}/?src=20&srcr=2000&ojs=6'
r = requests.get(url, headers=headers)
# If the requests is fine, proceed
if r.ok:
jobs = bs(r.content,'lxml').find('div',{'id':'job_listing_panel'})
data['title'].extend([i.text.strip() for i in jobs.find_all('div',{'class':'position-title header-text'})])
data['company'].extend([i.text.strip() for i in jobs.find_all('h3',{'class':'company-name'})])
data['location'].extend([i['title'] for i in jobs.find_all('li',{'class':'job-location'})] )
data['desc'].extend([i.text.strip() for i in jobs.find_all('ul',{'class':'list-unstyled hidden-xs '})])
else:
print('connection issues')
print(f'Page: {page} | Time taken {datetime.now()-start_time}')
return data
def multi_get_data(data,headers,start_page=1,end_page=20,workers=20):
start_time = datetime.now()
# Execute our get_data in multiple threads each having a different page number
with ThreadPoolExecutor(max_workers=workers) as executor:
[executor.submit(get_data, data=data,headers=headers,page=i) for i in range(start_page,end_page+1)]
print(f'Page {start_page}-{end_page} | Time take {datetime.now() - start_time}')
return data
# Test page 10-15
k = multi_get_data(data,headers,start_page=10,end_page=15)
Results:
Explaining the multi_get_data function:
This function will call get_data function in different threads with passing desired arguments. At the moment, each thread get a different page number to call. The maximum numbers of workers is set to 20, meaning 20 threads. You can increase or decrease accordingly.
We have created variable data, a default dictionary, that takes lists in. All threads will populate this data. This variable can then be cast to json or Pandas DataFrame :)
As you can see, we have 5 requests, each taking less than 2 seconds but yet the total is still under 2 seconds;)
Enjoy web scraping.
Update_: 22/12/2019
We could also gain some speed by using session with a single headers update. So we don’t have to start sessions with each call.
from requests import Session
s = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
# Add headers
s.headers.update(headers)
# we can use s as we do requests
# s.get(...)
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With