Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.

In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.

Here is what I believe to effectively be the (distilled) version of the job wrapper:

#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper

gotEXIT()
{
    ## end of script, however gotEXIT is called because we trap EXIT
    END="EndTime: `date`\nStatus: Ended”
    echo -e "${END}" >> job_info
    cat job_info | sendmail [email protected]
}
trap gotEXIT EXIT

function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar

MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail [email protected]
echo -e "${MSG}" > job_info

## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command

## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info

So I think there are two possibilities:

  1. echo -e "${MSG}" > job_info
    This command throws out corrupt data.

  2. /usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
    This corrupts the existing data, then outputs it’s data correctly.

However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.

So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:

FILE *outfp;

void main (int argc, char** argv) {
  const char **command_line;
  RESUSE res;

  /* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
  command_line = getargs (argc, argv); 
  /* run_command doesn't care about outfp */
  run_command (command_line, &res);
  /* internally, summarize calls fprintf and putc on outfp FILE pointer */
  summarize (outfp, output_format, command_line, &res);  /
  fflush (outfp);
}

So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.

EDIT:

A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:

The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:

Here's an example hexdump:

40000000 00000000 00000000 00000000 
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00] 

Then, at the 409th byte, it continues with the non-corrupted section:

Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec  6 15:29:27 PST 2013
Status: Ended

Another file looks like this:

01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000

Followed by the non corrupted section:

Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended

There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.

EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.

like image 324
Brian Avatar asked Nov 11 '22 16:11

Brian


1 Answers

Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.

like image 57
mornfall Avatar answered Nov 15 '22 04:11

mornfall