Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can a Stack Trace Point to the Wrong Line (the "return" Statement) - 40 Lines Off

I have twice now seen a NullReferenceException logged from a Production ASP.NET MVC 4 web application - and logged on the wrong line. Not wrong by a line or two (like you would get with a PDB mismatch), but wrong by the length of the entire controller action. Example:

public ActionResult Index()
{
    var someObject = GetObjectFromService();
    if (someObject.SomeProperty == "X") { // NullReferenceException here if someObject == null
        // do something
    }

    // about 40 more lines of code

    return View();    // Stack trace shows NullReferenceException here
}

This has happened twice for actions on the same controller. The second case was logged on

// someObject is known non-null because of earlier dereferences
return someObject.OtherProperty
    ? RedirecToAction("ViewName", "ControllerName")
    : RedirectToAction("OtherView", "OtherController");

This is very disturbing. NullReferenceException is very easy to fix once you know what line it occurs on. It's not quite so easy if the exception could have happened anywhere within the controller action!

Has anyone ever seen anything like this at all, either in ASP.NET MVC or elsewhere? I'm willing to believe it's the difference between a Release build and a Debug build, but still, to be off by 40 lines?


EDIT:

To be clear: I'm the original author of "What is a NullReferenceException and how do I fix it?". I know what a NullReferenceException is. This question is about why could the stack trace be so far off. I have seen cases where a stack trace is off by a line or two due to PDB mismatch. I have seen cases where there is no PDB, so you don't get line numbers. But I have never seen a case where the stack trace is off by 32 lines.

EDIT 2:

Note that this has happened with two separate controller actions within the same controller. Their code is quite different from each other. In fact, in the first case, the NullReferenceException didn't even occur in a conditional - it was more like this:

SomeMethod(someObject.SomeProperty);

There was some chance that the code had been reorganized during optimization so that the actual NullReferenceException occurred closer to the return, and the PDB was in fact only off by a few lines. But I don't see an opportunity to rearrange a method call in a way that would cause the code to move by 32 lines. In fact, I just looked at the decompiled source, and it does not appear to have been rearranged.

What these two cases have in common are:

  1. They occur in the same controller (so far)
  2. In both cases, the stack trace points to the return statement, and in both cases, the NullReferenceException occurred 30 or more lines away from the return statement.

EDIT 3:

I just did an experiment - I just rebuilt the solution using the "Production" build configuration that we have deployed to our Production servers. I ran the solution on my local IIS, without changing the IIS configuration at all.

The stack trace showed the correct line number.

EDIT 4:

I don't know if this is relevant, but the circumstance causing the NullReferenceException is as unusual as this "wrong line number" issue itself. We appear to be losing session state for no good reason (no restarts or anything). That's not too strange. The strange part is that our Session_Start should be redirecting to the login page when that happens. Any attempt to reproduce the session loss causes the redirect to the login page. Subsequently using the browser "Back" button or manually entering the previous URL goes right back to the login page without hitting the controller in question.

So maybe two weird problems is really one very weird problem.

EDIT 5:

I was able to obtain the .PDB file, and to look at it with dia2dump. I thought it was possible that the PDB was messed up, and only had line 72 for the method. That's not the case. All the line numbers are present in the PDB.

EDIT 6:

For the record, this just happened again, in a third controller. The stack trace points directly to the return statement of a method. This return statement is simply return model;. I don't think there's any way for that to cause a NullReferenceException.

Edit 6a:

In fact, I just looked more closely at the log and found several exceptions which are not NullReferenceException, and which still have the stack trace point at the return statement. Both of these cases are in methods called from the controller action, not directly in the action method itself. One of these was an explicitly thrown InvalidOperationException, and one was a simple FormatException.


Here are a couple of facts I haven't thought relevant until now:

  1. The Application_Error in the global.asax is what causes these exceptions to be logged. It picks up the exceptions by using Server.GetLastError().
  2. The logging mechanism logs the message and stack trace separately (instead of logging ex.ToString(), which would have been my recommendation). In particular, the stack trace that I've been asking about comes from ex.StackTrace.
  3. The FormatException was raised in System.DateTime.Parse, called from System.Convert.ToDate, called from our code. The stack trace line that points to our code is the line that points to "return model;".
like image 803
John Saunders Avatar asked Nov 21 '14 21:11

John Saunders


People also ask

How does a stack trace work?

Simply put, a stack trace is a representation of a call stack at a certain point in time, with each element representing a method invocation. The stack trace contains all invocations from the start of a thread until the point it's generated. This is usually a position at which an exception takes place.

What is stack trace error?

Stack trace error is a generic term frequently associated with long error messages. The stack trace information identifies where in the program the error occurs and is helpful to programmers. For users, the long stack track information may not be very useful for troubleshooting web errors.

What does end of stack trace from previous location where exception was thrown mean?

When the exception is restored, the following string is inserted in the stack trace to indicate the restore point: "End of stack trace from the previous location where the exception was thrown". This is similar to the way inner exceptions or marshaled exceptions are indicated in stack traces.


4 Answers

Can PDBs be off more than 2 or 3 lines?

You give the statement that you have never seen PDBs off more than a few lines. 40 lines seem to be too much, especially when the decompiled code doesn't look much of a difference.

However, that's not true and can be proven by a 2 liner: create a String object, set it to nulland call ToString(). Compile and run. Next, insert a 30 line comment, save the file, but do not recompile. Run the application again. The application still crashes, but gives a 30 lines difference in what it reports (line 14 vs. 44 in the screenshot).

It is not related at all to the code that gets compiled. Such things can easily happen:

  • code reformat, which e.g. sorts the methods by visibility, so the method moved up 40 lines
  • code reformat, which e.g. breaks long lines at 80 characters, usually this moves things down
  • optimize usings (R#) which removes 30 lines of unneeded imports, so the method moved up
  • insertion of comments or newlines
  • having switched to a branch while the deployed version (matching the PDB) is from trunk (or similar)

PDBs off by 30 lines

How can this happen in your case?

If it's really as you say and you seriously reviewed your code, there are two potential issues:

  • EXE or DLL does not match to PDBs, which can easily be checked
  • PDBs do not match to source code, which is harder to identify

Multithreading can set objects to null when you least expect it, even if it has been initialized before. In such a case, NullReferenceExceptions can not only be 40 lines away, it can even be in a totally different class and therefore file.

How to continue

Capture a dump

I'd first try to get a dump of the situation. This allows you to capture the state and look at everything in detail without the need of reproducing it on your developer machine.

For ASP.NET, see the MSDN blog Steps to Trigger a User Dump of a Process with DebugDiag when a Specific .net Exception is Thrown or Tess' blog.

In any case, always capture a dump including full memory. Also remember to collect all necessary files (SOS.dll and mscordacwks.dll) from the machine where the crash occured. You can use MscordacwksCollector (Disclaimer: I'm the author of it).

Check the symbols

See if the EXE/DLL really matches to your PDBs. In WinDbg, the following commands are helpful

!sym noisy
.reload /f
lm
!lmi <module>

Outside WinDbg, but still using debugging tools for Windows:

symchk /if <exe> /s <pdbdir> /av /od /pf

3rd party tool, ChkMatch:

chkmatch -c <exe> <pdb>

Check the source code

If PDBs match to DLLs, the next step is to check whether the source code belongs to the PDBs. This is best possible if you commit PDBs to version control together with the source code. If you did that, you can search the matching PDBs in source control and then get the same revision of source code and PDBs.

If you didn't do that, you're unlucky and you should probably not use source code but work with PDBs only. In case of .NET, this works out pretty well. I'm debugging a lot in 3rd party code with WinDbg without receiving the source code and I can get pretty far.

If you go with WinDbg, the following commands are useful (in this order)

.symfix c:\symbols
.loadby sos clr
!threads    
~#s
!clrstack
!pe

Why code is so important on StackOverflow

Also, I looked at the code of the View() method, and there is no way for it to throw a NullReferenceException

Well, other people have made similar statements before. It's easy to overlook something.

The following is a real world example, just minimized and in pseudo code. In the first version, the lock statement didn't exist yet and DoWork() could be called from several threads. Quite soon, the lock statement was introduced and everything went well. When leaving the lock, someobj will always be a valid object, right?

var someobj = new SomeObj(); 
private void OnButtonClick(...)
{
    DoWork();
}

var a = new object();   
private void DoWork()
{
    lock(a) {
        try {
            someobj.DoSomething();
            someobj = null;
            DoEvents();             
        }
        finally
        {
            someobj = new SomeObj();
        }
    }   
}

Until one user reported the same bug again. We were sure that the bug was fixed and this was impossible to happen. However, this was a "double-click user", i.e. someone who does a double click on anything that can be clicked.

The DoEvents() call, which was of course not in such a prominent place, caused the lock to be entered again by the same thread (which is legal). This time, someobj was null, causing a NullReferenceException in a place where it seemed impossible to be null.

That second time, it was return boolValue ? RedirectToAction("A1","C1") : RedirectToAction("A2", "C2"). The boolValue was an expression which could not have thrown the NullReferenceException

Why not? What is boolValue? A property with a getter and setter? Also consider the following (maybe a bit off) case, where RedirectToAction takes only constant parameters, looks like a method, throws an exception but is still not on the callstack. This is why it is so important to see code on StackOverflow...

Screenshot: method with constant parameters not on callstack

like image 59
Thomas Weller Avatar answered Oct 03 '22 19:10

Thomas Weller


I have seen this kind of behavior in production code once. Although the details are a little vague (It was approximately 2 years ago, and although I can find the email, I don't have access to the code anymore, and neither the dumps etc.)

FYI, this is what I wrote to the team (very small parts from the large mail) -

// Code at TeamProvider.cs:line 34
Team securedTeam = TeamProvider.GetTeamByPath(teamPath); // Static method call.

"No way null reference exception can happen here."

Later, after more dump diving

"Findings -

  1. The issue was happening in DBI because it did not have root/BRH team. UI is not handling the null returned by CLib gracefully, and hence the exception.
  2. The stack trace shown on UI was misleading, and was due to the fact that Jitter and CPU can optimize / reorder instructions, causing stack traces to "lie".

Digging into a process dump revealed the issue, and has been confirmed that DBI indeed did not have above mentioned team."


I think, the thing to note here is the statement in bold above, in contrast with your analysis and statement -

"I just looked at the decompiled source, and it does not appear to have been rearranged.", or

"Production build running on my local machine shows the correct line number."

The idea is that optimizations can happen at different levels.. and those done at compile time are just some of them. Today, especially with Managed environment like .Net, there are actually relatively fewer optimizations done while emitting IL (Why should 10 compilers for 10 different .Net languages try to do same set of optimizations, when the emitted Intermediate Language code will be further transformed into machine code, either by ngen, or Jitter).

Hence what you have observed, can only be confirmed by looking at the jitted machine code (aka assembly) from a dump from the production machine.


One question I can see coming is - Why would Jitter emit different code on Production machine, compared to your machine, for the same build?

Answer - I don't know. I am not a Jit expert, but I do believe it can... because as I said above.. Today these things are way more sophisticated compared to technologies used 5-10 years ago. Who knows, what all factors.. like "memory, number of CPUs, CPU load, 32 bit vs 64 bit, Numa vs Non-Numa, Number of times a method has executed, how small or big a method is, Who calls it, what does it call, how many times, access patterns for memory locations, etc etc" does it look at while making these optimizations.

For your case, so far only you can reproduce it, and only you have access to your jitted code in production. Hence, (If I may say so :)) this is the best answer anyone can come up with.


EDIT: An important difference between the Jitter on one machine vs the other, can also be the version of the jitter itself. I'd imagine that as several patches and KBs are released for .net framework, who knows what differences of optimization behavior jitter with even minor version differences may have.

In other words, it is not sufficient to assume that both machines have same Major version of the framework (lets say .Net 4.5 SP1). Production may not have patches which are released every day, but your dev / private machine may have patch released last Tuesday.


EDIT 2: Proof of concept - i.e. Jitter optimizations can lead to lying stack traces.

Run following code yourself, Release build, x64, Optimizations on, all TRACE and DEBUG turned off, Visual Studio Hosting Process turned off. Compile from visual studio, but run from explorer. And try to guess which line the stack trace will tell you the exception is at?

class Program
{
    static void Main(string[] args)
    {
        string bar = ReturnMeNull();

        for (int i = 0; i < 100; i++)
        {
            Console.WriteLine(i);
        }

        for (int i = 0; i < bar.Length; i++)
        {
            Console.WriteLine(i);
        }

        Console.ReadLine();

        return;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static string ReturnMeNull()
    {
        return null;
    }
}

Unfortunately, after few attempts, I still cannot reproduce exact problem you have seen (i.e. error on return statement), because only you have access to exact code, and any specific code pattern it may have. Or, once again, it is some other Jitter optimization, which is not documented, and hence hard to guess.

like image 26
Vikas Gupta Avatar answered Sep 29 '22 19:09

Vikas Gupta


Just a thought, but the one thing that I can think of is that perhaps there is a possibility that your build definition/configuration is pushing out an out of synch compiled version of your application dll(s) and this is why you see the discrepancy on your machine when you look up the line number from the stacktrace.

like image 20
mreyeros Avatar answered Sep 29 '22 19:09

mreyeros


The problem and its symptoms smells of a hardware problem, eg:

We appear to be losing session state for no good reason (no restarts or anything).

If using InProc Session State Storage switch to out of process. This will help you isolate the problem of losing sessions from the symptom of mismatched PDB line numbers on the NRE you're reporting. If using out of process storage, run some diagnostic utilities on the server.

ps post the output from DebugDiag. I probably should have put this answer as a comment but there are too many already, need to space them out and comment on different diagnostic steps separately.

like image 34
Jeremy Thompson Avatar answered Sep 29 '22 19:09

Jeremy Thompson