I am parsing JavaScript source files in python 3.5. A loop checks out all commits from a github repository, and the script loops through all changed files. When a file is changed in two subsequent checkouts (which means, it changed in the commit), the script can hang on the with(open...) line for seconds, even for moderate (~5-8 MB) file sizes. I have created an example script, that imitates the problem:
test_data = "./sample.js"
for _ in range(10):
start1 = time.time()
with open(file=test_data, mode="rb", buffering=1) as f:
end1 = time.time()
start2 = time.time()
line_content = f.readlines()
## Do some processing
end2 = time.time()
print("Processing file {} is done.".format(test_data))
print("Time spent on open is {0:10f}.".format(end1 - start1))
print("Time reading is {0:10f}.".format(end2 - start2))
with open(test_data, mode="a", encoding="utf-8") as fw:
fw.write("test")
The sample.js file is around 7 MB. Here is the output:
Processing file ./sample.js is done.
Time spent on open is 0.000000.
Time reading is 0.017001.
Processing file ./sample.js is done.
Time spent on open is 1.683999.
Time reading is 0.013999.
Processing file ./sample.js is done.
Time spent on open is 1.651003.
Time reading is 0.012030.
Processing file ./sample.js is done.
Time spent on open is 1.638999.
Time reading is 0.014997.
Processing file ./sample.js is done.
Time spent on open is 2.282346.
Time reading is 0.013001.
Processing file ./sample.js is done.
Time spent on open is 1.701004.
Time reading is 0.011998.
Processing file ./sample.js is done.
Time spent on open is 1.689004.
Time reading is 0.012995.
Processing file ./sample.js is done.
Time spent on open is 1.707036.
Time reading is 0.012959.
Processing file ./sample.js is done.
Time spent on open is 1.701031.
Time reading is 0.012969.
Processing file ./sample.js is done.
Time spent on open is 1.653999.
Time reading is 0.019003.
I have tried to use Process from multiprocessing, calling the garbage collector and also ExitStack from contextlib, but nothing helped.
Any ide what could cause this behaviour?
EDIT: Seems like the problem is Windows specific (at least, it wasn't as significant on Linux and MacOS).
Your OS is the culprit!!
This is why in the multiprocessing
documentation they added a specific paragraph for Windows in the Programming Guidelines. I highly recommend to read the Programming Guidelines as they already include all the required information to write portable multi-processing code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With