Thanks for responding back.
Unfortunately, the Log reader timeout problem has progressed far beyond the timeout issue. We have documented that the Timeout failure is actually leaving SQL threads running, and this has led to one of our production servers locking up twice in the past two days.
Here's the entire scenario:
It appears that when the Log Recovery process runs, it times out, but then it leaves one or two “orphaned” SPID’s that cannot be killed.
So what we think happened is this:
We responded to Help Desk ticket #62475 about AT&T Contract Renewal not working
After running several traces, we found out that the SP’s were failing because someone in Philippines had deleted codes from a table.
We tried to use the Log recovery tool to get these back, but the tools kept failing (which undoubtedly left the SPIDS running and chewing up CPU and resources)
Just as today, we started getting SQL errors in the Application log. At that time, we didn’t know why this was occurring. We had a false message from a controller, but that now looks to be unimportant.
Don tried to just restart SQL (didn’t work), so server was re-booted.
Everything came back up, so it seemed as if it was related to the ongoing firmware issues on the HP servers
I then went out and attempted to use the Redgate utility locally to see if we could find out who removed the records on ATT_CR.
Exact same thing happened there as when Sandy and I tried it from DB10-000. However, I had discovered the renegade threads this time, and tried to kill/rollback the SPID’s. This seemed to be making progress when I left last night. I check it later, and it showed 100% rollback; however, the SPID’s never did die. Found them in the same state this am.
This am (ticket # 62493):
Found that the renegade SPID’s were still active.
Monitored them, but they didn’t seem to be growing much, and systems were ok
Planned with Don to re-boot server Wed. night, and to add the –g switch to help with memory leak
Lockup occurred at around 11:30am
We re-booted the server and added the –g parameter to help stabilize the memory leak.
We also removed .Net 1.1/2.0, and Log Rescue.
Restarted server and server seemed ok at that point.
So, the LogRescue failure is being enhanced/exploiting the SQL Server memory leaks.
We can no longer trust the Log Rescue product at this point, and have removed it from our systems. I have contacted our executive staff and let them know that the product is unacceptable at this point, and we will be looking to other vendors, unless Redgate fixes this issue this month.
Robert, I do not take this action lightly. I have seen with my own eyes, the Redgate product running exactly as I described. We put alot of faith in your producxts when we first started looking at Redgate, but this product is very, very disappointing!!
This posting will be forwarded to my bosses, and to our Redgate representative, and to the Redgate executive staff today.
Eric (Rick) Sheeley, Sr. SQL/Oracle DBA
Sacramento, CA Cell: 602.540.6750
\"Those are my principles, and if you don't like them... well, I have others.\" - Groucho Marx