Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign python requests sessions for single processes in multiprocessing pool?

Considering the following code example:

import multiprocessing
import requests

session = requests.Session()
data_to_be_processed = [...]

def process(arg):
    # do stuff with arg and get url
    response = session.get(url)
    # process response and generate data...
    return data

with multiprocessing.Pool() as pool:
    results = pool.map(process, data_to_be_processed)

In example, Session is assigned as global variable, therefore after creating processes in Pool it will be copied into each subprocess. I am not sure whether the session is thread safe nor how pooling in session works, so I would like to assign separate session object for each process in pool.

I am aware, that I could just use requests.get(url) instead of session.get(url), but I would like to work with session and I am also considering using requests-html (https://html.python-requests.org/).

I am not very familiar with python's multiprocessing, so far I have used just pool, because it came to me as best solution to process data in parallel without having a critical section, so I am open for different solutions.

Is there a way to do it clean and straightforward?

like image 668
Nixwill Avatar asked Nov 17 '22 00:11

Nixwill


1 Answers

Short answer: you can use global namespace for sharing data between initializer and func:

import multiprocessing
import requests

session = None
data_to_be_processed = [...]

def init_process():
    global session
    session = requests.Session()

def process(arg):
    global session
    # do stuff with arg and get url
    response = session.get(url)
    # process response and generate data...
    return data

with multiprocessing.Pool(initializer=init_process) as pool:
    results = pool.map(process, data_to_be_processed)

Long answer: Python uses one of three possible start methods. All of them separate memory objects between parent process and child processes. In our case that means changes in global namespace of processes run by Pool() will not propagate back to parent process, neither to sibling processes.

For object destruction we could rely to Garbage Collector, which steps in once child process finishes it's work. Absence of explicit closing method in multiprocessing.Pool() makes it impossible to use with objects which are not destructible by GC (like the Pool() itself - see warning here ) Judging from requests docs, it is perfectly ok to use requests.Session without explicit close() on it.

like image 189
Timofey Chernousov Avatar answered Dec 18 '22 15:12

Timofey Chernousov