I'm working on a project to parse multiple xml files concurrently in python using lxml. When I initialize the process I want my main class to do some work on the XML before it passes the etree object to the process, but I am finding that when the etree object arrives in the new process the class survives but the XML is gone from within the object and getroot() returns None.
I know that I can only pass pickable data using the queue, but is this also the case with what I pass to the process inside the 'args' field?
Here's my code:
import multiprocessing, multiprocessing.pool, time
from lxml import etree
def compute(tree):
print("Start Process")
print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
print(id(tree)) # Returns new ID 44637320 as expected
print(tree.getroot()) # Returns None
def pool_init(queue):
# see http://stackoverflow.com/a/3843313/852994
compute.queue = queue
class Main():
def __init__(self):
pass
def main(self):
tree = etree.parse('test.xml')
print(id(tree)) # Returns object ID 43998536
print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>
self.queue = multiprocessing.Queue()
self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
self.pool.apply_async(func=compute, args=(tree,))
time.sleep(10)
if __name__ == '__main__':
Main().main()
Any and all help much appreciated.
UPDATE/ANSWER
Based on the answer in the next post down I've modified it a bit and managed to get it working with a much lower memory footprint without using String IO. The etree.tostring method returns a byte array, which can be pickled, then to unpickle it the byte array can be parsed by etree.
import multiprocessing, multiprocessing.pool, time, copyreg
from lxml import etree
def compute(tree):
print("Start Process")
print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
print(tree.getroot()) # Returns <Element SymCLI_ML at 0x29f5dc8>. Success!
def pool_init(queue):
# see http://stackoverflow.com/a/3843313/852994
compute.queue = queue
def elementtree_unpickler(data):
return etree.parse(BytesIO(data))
def elementtree_pickler(tree):
return elementtree_unpickler, (etree.tostring(tree),)
copyreg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)
class Main():
def __init__(self):
pass
def main(self):
tree = etree.parse('test.xml')
print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>
self.queue = multiprocessing.Queue()
self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
self.pool.apply_async(func=compute, args=(tree,))
time.sleep(10)
if __name__ == '__main__':
Main().main()
UPDATE 2
After doing some bench-marking with memory I found that passing large objects causes the objects to not be able to be cleared up by garbage collection on the main process. This probably isn't an issue at small scale, but by etree objects were in the order of multiple hundreds of MB in memory. As soon as an async task has been called with an XML object in the statement, that object cannot be cleared from memory if it is deleted from the main process, even my manually invoking garbage collection. So as a result I've reverted to closing the XML in the main process and passing the file name to the sub-process.
etree module. The lxml. etree module implements the extended ElementTree API for XML.
lxml Module in Python. lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
Use the following code to register simple picklers/unpicklers for lxml Element/ElementTree objects. I used that in the past with lxml and zmq.
import copy_reg
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
from lxml import etree
def element_unpickler(data):
return etree.fromstring(data)
def element_pickler(element):
data = etree.tostring(element)
return element_unpickler, (data,)
copy_reg.pickle(etree._Element, element_pickler, element_unpickler)
def elementtree_unpickler(data):
data = StringIO(data)
return etree.parse(data)
def elementtree_pickler(tree):
data = StringIO()
tree.write(data)
return elementtree_unpickler, (data.getvalue(),)
copy_reg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With