I have a situation in a code where there is a huge function that parses records line-by-line, validates and writes to another file.
In case there are errors in the file, it calls another function that rejects the record and writes the reject reason.
Due to a memory leak in the program, it crashes with SIGSEGV. One solution to kind of "Restart" the file from where it crashed, was to write the last processed record to a simple file.
To achieve this the current record number in the processing loop needs to be written to a file. How do I make sure that the data is overwritten on the file within the loop?
Does using fseek to first position / rewind within a loop degrade the performance?
The number of records can be lot, at times (upto 500K).
Thanks.
EDIT: The memory leak has already been fixed. The restart solution was suggested as an additional safety measure and means to provide a restart mechanism along with a SKIP n records solution. Sorry for not mentioning it earlier.
When faced with this kind of problem, you can adopt one of two methods:
ftell on the input file) to a separate bookmark file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you must fflush after every write (to both bookmark and output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness' sake, note that you have three ways of writing to your bookmark file:
fopen(..., 'w') / fwrite / fclose - extremely slowrewind / truncate / fwrite / fflush - marginally fasterrewind / fwrite / fflush - somewhat faster; you may skip truncate since the record number (or ftell position) will always be as long or longer than the previous record number (or ftell position), and will completely overwrite it, provided you truncate the file once at startup (this answers your original question)fflush files, or at least not so often. You still need to fflush the main output file before switching to writing to the rejects file, and fflush the rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.I strongly recommend method #2. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (fflush can take several milliseconds; multiply that by 500k and you get minutes - whereas counting the number of lines in a 500k-record file takes mere seconds and, what's more, the filesystem cache is working with, not against you on that.)
EDIT Just wanted to clarify the exact steps you need to implement method 2:
when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:
if you are not happy with the interval between the runtime's automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)
resuming from failure
records_resume_counter) until you reach the end of fileftell), let's call it last_valid_record_ends_here\n or `r`)
fseek back to last_valid_record_ends_here, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you've gone through all of themrecords_resume_counter records from it
last_valid_record_ends_here) - you will have no duplicate, garbage or missing records.If you can change the code to have it write the last processed record to a file, why can't you change it to fix the memory leak?
It seems to me to be a better solution to fix the root cause of the problem rather than treat the symptoms.
fseek() and fwrite() will degrade the performance but nowhere near as much as a open/write/close-type operation.
I'm assuming you'll be storing the ftell() value in the second file (so you can pick up where you left off). You should always fflush() the file as well to ensure that data is written from the C runtime library down to the OS buffers. Otherwise your SEGV will ensure the value isn't up to date.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With