Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does calling the Tesseract process cause this service to crash randomly?

I have a .NET Core 2.1 service which runs on an Ubuntu 18.04 VM and calls Tesseract OCR 4.00 via a Process instance. I would like to use an API wrapper, but I could only find one available and it is only in beta for the latest version of Tesseract -- the stable wrapper uses version 3 instead of 4. In the past, this service worked well enough, but I have been changing it so that document/image data is written and read from disk less frequently in an attempt to improve speed. The service used to call many more external processes (such as ImageMagick) which were unnecessary due to the presence of an API, so I have been replacing those with API calls.

Recently I've been testing this with a sample file taken from real data. It's a faxed document PDF that has 133 pages, but is only 5.8 MB in spite of that due to grayscale and resolution. The service takes a document, splits it into individual pages, then assigns multiple threads (one thread per page) to call Tesseract and process them using Parallel.For. The thread limits are configurable. I am aware that Tesseract has its own multithreading environment variable (OMP_THREAD_LIMIT). I found in prior testing that setting it to "1" is ideal for our set up at the moment, but in my recent testing for this issue I have tried leaving it unset (dynamic value) with no improvement.

The issue is that unpredictably, when Tesseract is called, the service will hang for about a minute and then crash, with the only error showing in journalctl being:

dotnet[32328]: Error while reaping child. errno = 10
dotnet[32328]:    at System.Environment.FailFast(System.String, System.Exception)
dotnet[32328]:    at System.Environment.FailFast(System.String)
dotnet[32328]:    at System.Diagnostics.ProcessWaitState.TryReapChild()
dotnet[32328]:    at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean)
dotnet[32328]:    at System.Diagnostics.Process.OnSigChild(Boolean)

I can't find anything at all online for this particular error. It would seem to me, based on related research I've done on the Process class, that this is occurring when the process is exiting and dotnet is trying to clean up the resources it was using. I'm really at a loss as to how to even approach this problem, although I have tried a number of "guesses" such as changing thread limit values. There is no cross-over between threads. Each thread has its own partition of pages (based on how Parallel.For partitions a collection) and it sets to work on those pages, one at a time.

Here is the process call, called from within multiple threads (8 is the limit we normally set):

private bool ProcessOcrPage(IMagickImage page, int pageNumber, object instanceId)
        {
            var inputPageImagePath = Path.Combine(_fileOps.GetThreadWorkingDirectory(instanceId), $"ocrIn_{pageNumber}.{page.Format.ToString().ToLower()}");
            string outputPageFilePathWithoutExt = Path.Combine(_fileOps.GetThreadOutputDirectory(instanceId),
                    $"pg_{pageNumber.ToString().PadLeft(3, '0')}");
            page.Write(inputPageImagePath);

            var cmdArgs = $"-l eng \"{inputPageImagePath}\" \"{outputPageFilePathWithoutExt}\" pdf";
            bool success;

            _logger.LogStatement($"[Thread {instanceId}] Executing the following command:{Environment.NewLine}tesseract {cmdArgs}", LogLevel.Debug);

            var psi = new ProcessStartInfo("tesseract", cmdArgs)
            {
                RedirectStandardError = true,
                RedirectStandardOutput = true,
                UseShellExecute = false,
                CreateNoWindow = true
            };

            // 0 is not the default value for this environment variable. It should remain unset if there 
            // is no config value, as it is determined dynamically by default within OpenMP.
            if (_processorConfig.TesseractThreadLimit > 0) 
                psi.EnvironmentVariables.Add("OMP_THREAD_LIMIT", _processorConfig.TesseractThreadLimit.ToString());

            using (var p = new Process() { StartInfo = psi })
            {
                string standardErr, standardOut;
                int exitCode;
                p.Start();
                standardOut = p.StandardOutput.ReadToEnd();
                standardErr = p.StandardError.ReadToEnd();        
                p.WaitForExit();
                exitCode = p.ExitCode;

                if (!string.IsNullOrEmpty(standardOut))
                    _logger.LogStatement($"Tesseract stdOut:\n{standardOut}", LogLevel.Debug, nameof(ProcessOcrPage));
                if (!string.IsNullOrEmpty(standardErr))
                    _logger.LogStatement($"Tesseract stdErr:\n{standardErr}", LogLevel.Debug, nameof(ProcessOcrPage));
                success = p.ExitCode == 0;
            }

            return success;
        }

EDIT 4: After much testing and discussion with Clint in chat, here is what we learned. The error is raised from a Process event "OnSigChild," that much is obvious from the stack trace, but there is no way to hook into the same event that raises this error. The process never times out given a timeout of 10 seconds (Tesseract typically only takes a few seconds to process a given page). Curiously, if the process timeout is removed and I wait on the standard output and error streams to close, it will hang for a good 20-30 seconds, but the process does not appear in ps auxf during this hang time. From the best that I can tell, Linux is able to determine that the process is done executing, but .NET is not. Otherwise, the error seems to be raised at the very moment that the process is done executing.

The most baffling thing to me is still that the process handling part of the code really hasn't changed very much compared to the working version of this code we have in production. This suggests that it's an error I made somewhere, but I am simply unable to find it. I think I will have to open up an issue on the dotnet GitHub tracker.

like image 620
Lucas Leblanc Avatar asked Oct 15 '22 05:10

Lucas Leblanc


1 Answers

"Error while reaping child"

Processes hold up some resources in the kernel, On Unix, when the parent dies, it is the init process that is responsible for cleaning up the kernel resources both Zombine and Orphan process (aka reaping the child). .NET Core reaps child processes as soon as they terminate.

"I have discovered that removing the stdout and stderr stream ReadToEnd calls causes the processes to end immediately instead of hang, with the same error"

The error is due to the fact that you are prematurely calling p.ExitCode even before the process has finished and with the ReadToEnd you are just delaying this activity

Summary of updated code

  • StartInfo.FileName should point to a filename that you want to start
  • UseShellExecute to false if the process should be created directly from the executable file and true if you intend that shell should be used when starting the process;
  • Added asynchrnous read operations to standard ouput and error streams
  • AutoResetEvents to signal when the output and error when the operations complete
  • Process.Close() to release the resources
  • It is easier to set and use ArgumentList over Arguments property

Redhat Blog on NetProcess on Linux

Revised Module

private bool ProcessOcrPage(IMagickImage page, int pageNumber, object instanceId)
{
    StringBuilder output = new StringBuilder();
    StringBuilder error = new StringBuilder();
    int exitCode;
    var inputPageImagePath = Path.Combine(_fileOps.GetThreadWorkingDirectory(instanceId), $"ocrIn_{pageNumber}.{page.Format.ToString().ToLower()}");
        string outputPageFilePathWithoutExt = Path.Combine(_fileOps.GetThreadOutputDirectory(instanceId),
                $"pg_{pageNumber.ToString().PadLeft(3, '0')}");
    page.Write(inputPageImagePath);

    var cmdArgs = $"-l eng \"{inputPageImagePath}\" \"{outputPageFilePathWithoutExt}\" pdf";
    bool success;

    _logger.LogStatement($"[Thread {instanceId}] Executing the following command:{Environment.NewLine}tesseract {cmdArgs}", LogLevel.Debug);


    using (var outputWaitHandle = new AutoResetEvent(false))
    using (var errorWaitHandle = new AutoResetEvent(false))
    {
        try
        {
            using (var process = new Process())
            {
                process.StartInfo = new ProcessStartInfo
                { 
                    WindowStyle = ProcessWindowStyle.Hidden,
                    FileName = "tesseract.exe", // Verify if this is indeed the process that you want to start ?
                    RedirectStandardOutput = true,
                    RedirectStandardError = true,
                    UseShellExecute = false,
                    CreateNoWindow = true,
                    Arguments = cmdArgs,
                    WorkingDirectory = Path.GetDirectoryName(path)
                };



                if (_processorConfig.TesseractThreadLimit > 0) 
                    process.StartInfo.EnvironmentVariables.Add("OMP_THREAD_LIMIT", _processorConfig.TesseractThreadLimit.ToString());


                process.OutputDataReceived += (sender, e) =>
                {
                    if (e.Data == null)
                    {
                        outputWaitHandle.Set();
                    }
                    else
                    {
                        output.AppendLine(e.Data);
                    }
                };
                process.ErrorDataReceived += (sender, e) =>
                {
                    if (e.Data == null)
                    {
                        errorWaitHandle.Set();
                    }
                    else
                    {
                        error.AppendLine(e.Data);
                    }
                };

                process.Start();

                process.BeginOutputReadLine();
                process.BeginErrorReadLine();

                  if (!outputWaitHandle.WaitOne(ProcessTimeOutMiliseconds) && !errorWaitHandle.WaitOne(ProcessTimeOutMiliseconds) && !process.WaitForExit(ProcessTimeOutMiliseconds))
                  {
                    //To cancel the read operation if the process is stil reading after the timeout this will prevent ObjectDisposeException
                    process.CancelOutputRead();
                    process.CancelErrorRead();

                    Console.ForegroundColor = ConsoleColor.Red;
                    Console.WriteLine("Timed Out");

                    //To release allocated resource for the Process
                    process.Close();
                    //Timed out
                    return  false;
                  }

                  Console.ForegroundColor = ConsoleColor.Green;
                  Console.WriteLine("Completed On Time");

                 exitCode = process.ExitCode;

                  if (!string.IsNullOrEmpty(standardOut))
                    _logger.LogStatement($"Tesseract stdOut:\n{standardOut}", LogLevel.Debug, nameof(ProcessOcrPage));
                  if (!string.IsNullOrEmpty(standardErr))
                    _logger.LogStatement($"Tesseract stdErr:\n{standardErr}", LogLevel.Debug, nameof(ProcessOcrPage));

                 process.Close();

                 return exitCode == 0 ? true : false;
            }
        }
        Catch
        {
           //Handle Exception
        }
    }
}
like image 186
Clint Avatar answered Nov 15 '22 06:11

Clint