The Single biggest cause of Lotus Notes client crashes and how to avoid them
While reviewing an environment with about 3000 users, I discovered an extremely high number of fault reports occurring. On a daily basis there were from 100 to 200 faults reported. Some users were crashing every single day. Clearly this points to a systemic problem, probably due to some software conflict or other configuration issue widely used within this organization. Yet for all these crashes, the users were not reporting any problems. While they weren’t reporting problems, this was likely to lead to bigger problems from file corruption if it wasn’t already. I needed to find the cause. One catch though: I had limited access to the computers or contact with the users. This can make troubleshooting very difficult.
The first step was to examine the data submitted in the Fault Reports database. Unfortunately, the crashes were not reporting much, if any, useful data, including only partial .NSD files. Fewer than 10% of the crashes even reported a version, but of those that did, they were all either Release 8.5.2 or 8.5.3 with various Fix Packs. While we were only about half way through an upgrade from 7.0.x to 8.5.3, none of the crashes reported a version of 7.x. If all the crashes are 8.5.x, then that makes the fault rate even worse; about 10% per day for fifteen hundred 8.5.x users! Yet no one was reporting any problems. Quite the mystery.
The next logical step would be to run Fault Analyzer against the Fault Reports database to look for trends in the fault reports and to examine whatever is available in the .NSD files for any clues. The .NSD files were mostly empty and Fault Analyzer proved useless because there wasn’t enough data reported in the fault reports. For those crashes that did report some data, examining them manually, I found a common thread among some of the crashes:
Host Name : LAPTOP1234
User Name : SYSTEM
Date : Thu Oct 11 10:33:24 2012
Windows Dir : C:\Windows
Arguments : “C:\Program Files (x86)\IBM\Lotus\Notes\nsd.exe” -dumpandkill -termstatus 1 -dlgopts showwait -wctpid 5292 -wctexitcode 1073807364 -panicdirect -crashpid 3940 -crashtid 516 -runtime 300 -ini “C:\Program Files (x86)\IBM\Lotus\Notes\notes.ini” -svcreq 128
NSD Version : 220.127.116.112 (Release 8.5.2FP3)
OS Version : Windows/7 6.1 [64-bit] (Build 7601), PlatID=2, Service Pack 1 (8 Processors)
Running as 32-bit Windows application on 64-bit Windows
Build time : Mon Jul 11 03:15:18 2011
Latest file mod : Fri May 13 09:03:31 2011
Notes Version : (32-bit client)
ERROR (79): the directory () does not exist – (22) Invalid argument
ERROR (44): unable to open file ‘C:\Program Files (x86)\IBM\Lotus\Notes\Data\formats.ini’ – (2) No such file or directory
This is an odd error, but searching the web I did find others who reported a similar problem and they solved it by getting a copy of the formats.ini file from a good installation and adding it to their computer. Could it be that our customized installation kit was missing this file? If so, it would be a straightforward fix, though it would have to be applied to all computers already upgraded. However, an inspection of one of the computers that had been crashing revealed the file is right where it should be. This was a dead end.
Finally I was able to work with one user on the issue. She had been crashing several times a week for the past few months though she never noticed. The crash reports were time stamped fairly consistently at around 7:30 AM correlating with the time she came in to work. The user did not report any unusual behavior when she started her computer, though occasionally Lotus Notes did “take a long time to start”. So one morning I watched her go through her morning routine of starting up and logging in. There was nothing unusual. No crash report posted either. Time to do more trend analysis.
I created several views in the Fault Reports database trying to identify any other trends using different categorized sorts: by date, by user, by hour of the day. When categorized by the hour of day, the crashes revealed a trend. The majority of crashes were in the afternoon between 1:00 PM and 5:00 PM (hours 13 – 16).
I sorted this view further by user. From this I noticed that, while the crashes were scattered throughout the afternoon, for any given person they were usually crashing in the same hour almost every time. I re-sorted the view so it was first categorized by user and then by hour and added a column with the exact time of crash. Now I could see all the crashes for one person grouped together and categorized by hour. Then scanning through the users with very high crash counts, I found the final clue: One user crashed at precisely 5:00 PM every single day. This user was crashing at precisely 5:00 PM every day and the crashes were being submitted consistently at 8:02 AM the next day.
This person happened to be the receptionist. Her work hours are precisely from 8 to 5. Looking more closely at the other users I could see the crashes were typically occurring about 8 hours after the previous crash report was submitted by each person. It is important to note here that the crash report is reported (Creation date/time) at the next restart of Notes. In other words, Notes would crash at the end of their day and they didn’t restart Notes until the next morning.
I called the receptionist and asked how she shuts down her computer at the end of the day. I expected to hear her say she just hits the power button, but that was not the case. It turns out she clicks the X in the top right corner of Notes to close the window, then clicks Log Off on the Start menu immediately after. Apparently Notes 8.5.x takes longer to close than 7.x and it was not able to close before the OS dumped it from memory during shutdown, thus causing it to not close cleanly.
With a bit of user training, this problem has been resolved. They were told to give Notes an extra minute to shut down before logging out or just lock or hibernate the computer instead of logging off.
I think this is a flaw in the interaction between the OS and Notes, but until that is fixed, this is a clean, simple work-around. What are your ideas and experiences with this?