Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 can't pickle _thread.RLock objects on list with multiprocessing

I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL.

def araba_cekici(liste_test,headers,engine):
    for link in liste_test:
        try:
            page = requests.get(link, headers=headers)
        .....
        .....

When I start my code like that:

araba_cekici(liste_test,headers,engine)

It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing.

I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.

import numpy as np
import multiprocessing as multi

def chunks(n, page_list):
    """Splits the list into n chunks"""
    return np.array_split(page_list,n)

cpus = multi.cpu_count()

workers = []   
page_bins = chunks(cpus, liste_test)


for cpu in range(cpus):
    sys.stdout.write("CPU " + str(cpu) + "\n")
    # Process that will send corresponding list of pages 
    # to the function perform_extraction
    worker = multi.Process(name=str(cpu), 
                           target=araba_cekici, 
                           args=(page_bins[cpu],headers,engine))
    worker.start()
    workers.append(worker)

for worker in workers:
    worker.join()

And it gives:

TypeError: can't pickle _thread.RLock objects

I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code). Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.

like image 909
Orhan Solak Avatar asked May 17 '18 12:05

Orhan Solak


1 Answers

Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue, which requires all data/objects sent to be picklable.

In your code, you try to pass header and engine, whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.

like image 55
IonicSolutions Avatar answered Sep 20 '22 16:09

IonicSolutions