Monday, August 24, 2009

Surviving the third prolonged Software crash

I am going to remember this day for my entire life. On a Festival Day like this I was working from Morning 9:30 to 5:00 evening. This is the third time when Vendor just could not fix it & I had to take some initiative as nobody was willing to take decisions. One person helped me a lot (never met him though...he is in London). First crash was the worst, second one did not last for even an hour.
This was the worst downtime I have ever seen....more than 40 hours!!! I would have felt ashamed of it but it was not my fault. One of my colleague was asking about what I was doing so late & I answered..."Bureaucracy kills Efficiency..".

I still remember the first downtime...
It was a huge box (Sun v880) & during a Restore it crashed..not even once..it crashed thrice...with no errors !!
I called up Veritas .....I spent 1 hour describing the issue on phone with them for 3 hours...that was the first time I spoke to a white guy for such a long time..

After almost three months I was able to find a Kernel Parameter which triggers flush of Cache to Disk /etc/system -> priority_paging=1
This fixed the issue permanently, while resolving this issue, I understood the importance of Caching, I/O tuning, lot of low level configurations in OS.

Second Time it was Oracle Database & even this time vendor did not help...Oracle's employee helped us out of this situation. Thanks to him, we were saved in just the nick of time.

This time it was the most known Oracle Parameter which creates a mess & it was eventually removed in Oracle 10g. I am sure Oracle might have got lot many queries due to the stupid parameter. Which made them wipe the parameter in next version.

This time (third) it is VMware. They tried Reaping the software which did not work. We had to reinstall everything & do all the configuration manually. Entire process took 12 hours of many people.

Half the time we wasted in doing co-ordination & chasing Directors for approvals. By this time everybody must have known my name in the entire organization.

I certainly need a break after this three days long operation for System Recovery. Funny thing is I am still feeling that it might Crash any moment & my phone might ring again.

Some times Senior managers start shouting in such situations but I have learnt to deal with such people who make unnecessary noise & also start a blame game instead of helping people. I think best is to answer all their questions at that time, that makes them cool down a bit.

Well..time will be able to tell whether system is stable or not, I am going to wait for one week before I make any comment.

No comments: