How can I scrape faster

Tags:

The work here is to scrape an API a site that starts from https://xxx.xxx.xxx/xxx/1.json to https://xxx.xxx.xxx/xxx/1417749.json and write it exactly to mongodb. For that I have the following code:

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
    response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
    if response.status_code == 200:
        parsed = json.loads(response.text)
        inserted = com.insert_one(parsed)
        write_log.write(str(n) + "\t" + str(inserted) + "\n")
        print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()

But it is taking lot of time to do the task. Question here is how can I speed up this process.

866

asked Dec 13 '19 06:12

Tek Nath

2 Answers

There are several things that you could do:

Reuse connection. According to the benchmark below it is about 3 times faster
You can scrape in multiple processes in parallel

Parallel code from here

from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in open('urllist.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

Timings from this question for reusable connection

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953

answered Oct 05 '22 00:10

keiv.fly

You can improve your code on two aspects:

Using a Session, so that a connection is not re-arranged at every request and is kept open;
Using parallelism in your code with asyncio;

Give a look here https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

answered Oct 05 '22 02:10

albestro

Related questions
                            
                                Pandas Merge on Name and Closest Date
                            
                                Dynamically changing dropdowns in IPython notebook widgets and Spyre
                            
                                aggregate a field in elasticsearch-dsl using python
                            
                                Mock parent class __init__ method
                            
                                'b' character added when using numpy loadtxt [duplicate]
                            
                                Scraping in Python - Preventing IP ban
                            
                                How to convert csv to json in python?
                            
                                One line to check if string or list, then convert to list in Python [duplicate]
                            
                                How do I serve media files in a local Django environment?
                            
                                How can I get the attribute name when working with descriptor protocol in Python?
                            
                                Python Requests hanging/freezing
                            
                                TypeError: ManyRelatedManager object is not iterable
                            
                                Merge 2 sequential models in Keras
                            
                                How to install awscli using pip in library/node Docker image
                            
                                What does "& 0x7fffffff" mean in "int(time.time()*1000.0) & 0x7FFFFFFF"
                            
                                pathlib.Path().glob() and multiple file extension
                            
                                `SyntaxError: invalid syntax` when starting Python script in VS Code on macOS
                            
                                How do I check if a value matches a type in python?
                            
                                how to retry async aiohttp requests depending on the status code
                            
                                How to implement --version using python click?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I scrape faster

Tags:

python

mongodb

web-scraping

pymongo

Tek Nath

People also ask

2 Answers

keiv.fly

albestro

Recent Activity

Donate For Us