Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I scrape faster

The work here is to scrape an API a site that starts from https://xxx.xxx.xxx/xxx/1.json to https://xxx.xxx.xxx/xxx/1417749.json and write it exactly to mongodb. For that I have the following code:

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
    response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
    if response.status_code == 200:
        parsed = json.loads(response.text)
        inserted = com.insert_one(parsed)
        write_log.write(str(n) + "\t" + str(inserted) + "\n")
        print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()

But it is taking lot of time to do the task. Question here is how can I speed up this process.

like image 866
Tek Nath Avatar asked Dec 13 '19 06:12

Tek Nath


People also ask

Why is web scraping slow?

The network delay is the first obvious bottleneck for any web scraping project. Transmitting a request to the web server takes time. Once the request is received, the web server will send the response, which again causes a delay.

How long does web scraping take?

Assuming you're running 100 million requests at the rate of 1 request per second per IP and using 1000 data center IPs, your scraping project can take about 30 hours. However, if you are using a proxy network with a pool of 10,000,000 residential IPs , your scraping can theoretically take only 10 seconds.

Is Scrapy faster than beautiful soup?

Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.


2 Answers

There are several things that you could do:

  1. Reuse connection. According to the benchmark below it is about 3 times faster
  2. You can scrape in multiple processes in parallel

Parallel code from here

from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in open('urllist.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

Timings from this question for reusable connection

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953
like image 96
keiv.fly Avatar answered Oct 05 '22 00:10

keiv.fly


You can improve your code on two aspects:

  • Using a Session, so that a connection is not re-arranged at every request and is kept open;

  • Using parallelism in your code with asyncio;

Give a look here https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

like image 33
albestro Avatar answered Oct 05 '22 02:10

albestro