The work here is to scrape an API a site that starts from https://xxx.xxx.xxx/xxx/1.json
to https://xxx.xxx.xxx/xxx/1417749.json
and write it exactly to mongodb. For that I have the following code:
client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
if response.status_code == 200:
parsed = json.loads(response.text)
inserted = com.insert_one(parsed)
write_log.write(str(n) + "\t" + str(inserted) + "\n")
print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()
But it is taking lot of time to do the task. Question here is how can I speed up this process.
The network delay is the first obvious bottleneck for any web scraping project. Transmitting a request to the web server takes time. Once the request is received, the web server will send the response, which again causes a delay.
Assuming you're running 100 million requests at the rate of 1 request per second per IP and using 1000 data center IPs, your scraping project can take about 30 hours. However, if you are using a proxy network with a pool of 10,000,000 residential IPs , your scraping can theoretically take only 10 seconds.
Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.
There are several things that you could do:
Parallel code from here
from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in open('urllist.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
Timings from this question for reusable connection
>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953
You can improve your code on two aspects:
Using a Session
, so that a connection is not re-arranged at every request and is kept open;
Using parallelism in your code with asyncio
;
Give a look here https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With