DevOps Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 479 posts at DZone. You can read more from them at their website. View Full User Profile

Diagnosing a Production Issue on Our Servers

09.26.2013
| 2850 views |
  • submit to reddit

One of the steps that we take before a stable release is to push the latest build into our own production servers, and see what happens. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and finds the sort of bugs that are very hard to run in production.

In this case, after several hours (8 to 12, I'm guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare and Commit phases of the operation.

Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened.  Sad smileSad smileSad smile

The good thing about having this happen on our production servers is that I have full access there. Of course, it is in production, so not outright debugging it out, but taking a dump and transferring that to my machine was easy enough.

Now, open it with WinDBG, run “.loadby sos clr” and start investigating.

First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.

Next, it was time to look a bit at the actual memory. I ran a !DumpHeap type session

image

My reaction was, "Huh!!! There is absolutely zero justification for that."

Now, the only question is why. So I decided to look at the class that holds the transaction state, assuming that this is probably what is holding onto all those sessions. I ran a !DumpHeap -type EsentTransactionContext:

image

And that tells me quite a lot. There appear to be now a total of 317 in-flight DTC transactions. Considering that I know what our usage is, that is a huge number, and it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in-flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat

image

In other words, we have 8 loaded DBs, each of them holding their in-flight transactional state, and we have 317 opened transactions and 35 thousand sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single-digit concurrent DTC transactions at any one time. So somehow we are leaking transactions and sessions, but I am still very unhappy with just “we are leaking sessions.” That is something that I knew before we started debugging everything.

I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file and then analyze it in a much nicer manner. Doing so, I quickly found this out:

image

Huh?!

Sessions are finalizable, sure, but I am very careful about making sure to dispose of them, especially after the previous code change. There should be just 317 undisposed sessions, and having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there, and the numbers don’t match, either. We are missing about 7K items from the WinDBG numbers.

OK, next, I pulled ClrMD and wrote the following:

 var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");
 var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();
 var rt = dt.CreateRuntime(moduleInfo);
  
 var clrHeap = rt.GetHeap();
  
 var finalized = new HashSet<ulong>();
  
 int cnt = 0;
 foreach (var ptr in rt.EnumerateFinalizerQueue())
 {
     var type = clrHeap.GetObjectType(ptr);
     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
     {
         finalized.Add(ptr);
     }
 }
 Console.WriteLine(finalized.Count);
  
 var live = new HashSet<ulong>();
 foreach (var ptr in clrHeap.EnumerateObjects())
 {
     var type = clrHeap.GetObjectType(ptr);
     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
     {
         if (finalized.Contains(ptr) == false)
             live.Add(ptr);
     }
  }
 Console.WriteLine(live.Count);

This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?

I did a code review over everything once again, and I think that I got it. The culprit is this guy:

image

Where createContext is defined as:

image

Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method, because those create values that have to be disposed…  Now, in the normal course of things, the worst-case scenario would be that we would have them in the finalizer queue and they would be disposed of in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.

At least, this is my current theory. I changed the code to be like this:

image

So, if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there isn’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.



Published at DZone with permission of Ayende Rahien, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)