[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[D2lnews] Info regarding May 21 errors/outage



We apologize for the outages some folks experienced yesterday midday.
Below is an explanation from Learn@UW about what happened and what they
are implementing to insure it doesn't happen again.

AnnMarie

Problem:
The logfile on the file server quorum disk filled up around 2:45am
which consequently took the file server cluster down causing files on
the file server to be unavailable.   Basically, the file server quorum
disk is what the servers in the cluster check to see if things are ok.
This became unavailable due to the logfile for the backups (TSM)
filling up; the cluster assumed there were problems and
shutdown.

Impact:
>From 6:30am to 1:14pm - there were 128 file server related errors
logged in the learn@uw error table. Customers experienced the following
problems:
* NAV bar and images in Learn@UW courses not loading
* Some users experienced "unexpected error" messages in place of the
content; for others, this area is blank.
 
Resolution:
The cluster database file became corrupt and needed to be re-created.
After this was done, the cluster resources were then able to come back
online and were available around 1:15pm.

Long-term solution:
We will add new disk to the file server and start writing our logs
there so the cluster will not be effected by diskspace issues.