I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup
is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool
to from multiprocessing import Pool
. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map
converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap
:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap
, we never put the all of f
into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize
lines of f
, which should be more manageable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With