I just solved a really tricky problem with a lot of help from posts on this forum, so I though I would post my story here
as way to say thank you, and to help others in the future...
(tl;dr: got BSoD's on a 24 hour cycle. It was a driver memory leak)
My laptop (an MSI GX700) was built by AVADirect in early 2008, and I figured it would be a good idea to upgrade the hard drive before the old one failed (everything was fine, really)
I bought a new 500GB Western Digital Scorpio Blue and an external enclosure for the drive clone operation. I purchased "Acronis True Image Home 2011" to perform the drive clone. Installation day was July 1st. I put the new drive in the external enclosure and cloned the old drive to it. It seemed to go very well, but as I learned later, that's not the best way to do it - I should have performed a "Reverse Clone", from external to internal... doing it wrong might
mess up your disk formatting it seems ...okay, right.
And because I wanted to record game videos with Fraps, and Fraps works best with a dedicated hard drive, and my laptop only has one internal drive and it's USB 2.0 only, I ordered an ExpressCard USB 3.0 adapter, a USB 3.0 enclosure, and another Western Digital Scorpio - "Black" (7200 RPM) this time.
Got my first BSoD before the new external arrived; didn't worry about it too much. The BSoD's got more frequent after the external USB 3.0 setup was in place. Then I started to worry. The USB 3.0 setup was none too reliable anyway, so I suspected it as the source of the problem. Ran Spinrite and Chkdsk - no problems found.
BSoD's started occurring daily by August. I gave up using the external HDD, and later uninstalled the USB 3.0 driver.
All BSOD's were like, "Stop: 0x000000F4 (0x00000003, ####, ####, ####)"
In Event Viewer, found events like these:
Event Source: Ftdisk
Event ID: 57
Description: The system failed to flush data to the transaction log.
Corruption may occur.
Event Source: Srv
Event ID: 2019
Description: The server was unable to allocate from the system nonpaged
pool because the pool was empty.
I ran System File Checker, Windows Memory Diagnostic, memtest86+, and WD Data Lifeguard - no errors found. I updated all drivers. Still the BSoD's continued.
I learned that if I rebooted every 24 hours, I could avoid the blue screen. If I waited 25 or 26 hours though, the crash would happen for sure. What in the world would do this? The answer lay in the regularity, but I didn't discover the true cause for quite a long while...
I wondered if I had a drive compatibility problem (too large? OEM Sata I
vs. new Sata II
?), so I posted a query to the Western Digital support forum, and another query to the MSI support forum. No problem, I was told.
I ran the WD Align utility, which fixed a gradual slowing of my CrystalDiskMark HDD benchmark scores, due to the new Advanced Format of these drives.
Entering October, I reinstalled the OEM drive to eliminate the compatibility question (unfortunately I had wiped the drive, so I cloned back again, still not suspecting the cloning process itself). My blue screens continued as before. I reinstalled my new drive, since it was better in every way.
Around this time is when I learned about reverse cloning:
I posted a query to the Acronis support forum, where I was advised the cloning method I used was probably
not a problem, and was advised to rum memtest86, which I did. I also ran System File Checker (sfc /scannow) and CheckDisk (chkdsk /f) Then I tried a reverse clone from the "Black" Scorpio to the "Blue" one, in case the HDD itself was bad. My blue screens continued as before. Of course, if Windows was corrupted by the first clone operation, I was not going to get rid of it by cloning again - however if it was a disk format issue it might have helped.
Now it's mid-November - after 3+ months of daily reboot-or-die. I still really did not want to do a totally fresh install, because of the work that would be involved (tracking down program disks, licenses etc). So I looked into alternatives and ended up creating an ASR backup of the current (flawed) setup - I had no existing ASR backup, damn and blast
- then doing an ASR recovery to the same disk, after fully reformatting. This required a fresh install of XP Pro SP2, and when this was done - but before the recovery - I did an ASR backup so I could go back in case the recovery went wrong.
Well, there was no change - I still had the blue screens - but at least I had eliminated any disk formatting issues.
Here's another typical error event:
Event Source: Disk
Description: An error was detected on device \Device\Harddisk1\D during a
BTW - what does "\Device\Harddisk1\D" mean? Well, here is the answer:
"How to Distinguish a Physical Disk Device from an Event Message"
"On Windows XP...the DeviceName may be truncated because of the
size limitation of the event log entry"
ExceptionCode: c0000006 (In-page I/O error)
Inpage operation failed at 7c99c3d8, due to I/O error c000009a
Event Source: Disk
Description: An error was detected on device \Device\Harddisk2\D during a
I updated every possible driver, especially SATA, USB, NIC & WiFi -- no joy.
At this point I'm searching far and wide for any kind of related problem - that's when I found hardwareanalysis.com
There are a couple very long threads there, filled with helpful - and crazy - ideas:
"Delayed Write Failed / Error"
"Event ID - 51 - An error was detected on device \Device\Harddisk0\D..."
Finally - I forget how it happened exactly, but it was thanks to an idea posted on one of the above threads - I started to suspect low system "Page Table Entries", and began tracking that and other metrics using Windows Performance System Monitor. Nothing looked out of place, except - what's this? - the chart line for "Pool Nonpaged Bytes" kept creeping up and up, never going down even when all applications were closed. Then I read an excellent article explaining the nonpaged pool & much more, by Mark (Systernals) Russinovich: "Pushing the Limits of Windows: Paged and Nonpaged Pool"
...and suddenly it became clear: I had a memory leak
in a driver
somewhere! When "Pool Nonpaged Bytes" reaches the 32-bit hard limit (normally 256 MB), a blue screen is the certain result. This is the reason for the clockwork regularity: the "Pool Nonpaged Bytes" increased at a slow and steady pace until all nonpaged memory was exhausted - a process that always took about 24 hours (I probably "slept" the machine for about the same amount of time each day).
Using Windows Support Tools "poolmon" and "driver verifier", all I could determine was that the leak was happening in ntoskrnl.exe, which didn't help a lot (it's the kernal forcryingoutloud). I made sure the file was up to date, etc.
Running Systernals Process Explorer, I saw a lot of registry activity revolving around the nonexistent value "HKLM\System\CurrentControlSet\Control\UsbFlags\FlushPortPowerIrpsFlag". Something was not right in USB-land. Could this be my old USB 3.0 adapter, still not fully uninstalled? Or something else?
Skipping some details here, I tried shutting down all noncritical applications and services and disabling all noncritical devices. This took several tries, as it's hard to know what's critical before you disable it. Anyway, by carefully watching the rise of "Pool Nonpaged Bytes" in System Monitor, and disabling one application, service or device at a time (at first I expected a huge drop in nonpaged bytes when the offender was terminated, but that never happened - so I started over, looking for the chart line to stop
rising, which required really zooming in
), I finally zeroed in on the Intel ICH8 USB Universal Host Controllers - of which there are seven installed. I found that disabling one or two of them allowed all my devices to function, while plugging the memory leak. SUCCESS!!
So, do I have a hardware error on my Intel ICH8 chipset? Did my sketchy (no names please) external USB adapter zap it? Or did the driver installation fail - for one module only? Right now I'm too tired to worry about it...I just want to enjoy my laptop again. It's a horrible feeling, having your life running on a machine that's not reliable. I'm just glad it's over - at least I hope it's over.
(Edited for a little more clarity, 02-Dec. If something still not clear please ask)
(Edit 06-Dec: I see now that I had disk errors in addition to the memory leak, which
is why the error messages were so confusing. See "reverse clone" above)
PS - the "Love Story" in the title is there just because I wanted to make it sound more "cheerful". I guess I could have said, "Blue Screen (STOP: 0x000000F4) - SOLVED!!", but this was funny to me at the time (I was tired). It's like some of those old TV movies: "Cancer - a Love Story"; "Herpes - a Love Story"; "Wristcutters - a Love Story" (that last one is for real, the others might
be real, I don't know)