I mistakenly had two scripts running at the same time that wrote a pandas dataframe in chunks to the same CSV file. Since the CSV file was supposed to be appended, the script itself doesn't block writing to the CSV file if it already exists. I didn't catch it until it was too late.
Kinda like this:
script1.py
for i, chunk in enumerate(datachunks):
do something
result_df.to_csv('csvfile.csv') (in write mode for the 1st chunk, append mode for the next chunks)
script2.py
for i, chunk in enumerate(datachunks2):
do something
result_df.to_csv('csvfile.csv') (in write mode for the 1st chunk, append mode for the next chunks)
# should have been csvfile2.csv
Each script takes around 12 hours to execute due to the sheer volume of data that has to be processed, and I think it's faster to separate the CSV file into two so that I get the outputs that each script should have given. This should work -- unless I have unintended duplicates in the file, or even lines that didn't write.
Both scripts finished without any errors, if that's relevant.
Any chance of duplicates/missing data in this csvfile.csv
?
I decided to just rerun the scripts and compare the outputs. Seems it's not promising -- I lost a lot of rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With