Resolved -
This incident has been resolved.
Dec 15, 16:29 NZDT
Monitoring -
WEKA support have managed to restore service and the original storage cluster version upgrade is now continuing in the background. Access to Mahuika/HPC3 services is now restored and most running jobs appear to have survived the storage outage. Some jobs may have completed in a failed state in Slurm - users should review outputs before rerunning failed jobs.
We expect continued intermittent performance issues while the upgrade completes and are monitoring this closely.
Apologies for any disruption this caused to your work this afternoon!
Dec 9, 16:58 NZDT
Update -
The storage system is slowly being recovered. There may be some filesystem access available but performance will be very degraded. We hope to be fully resolved in about an hour, but it could take longer
Dec 9, 16:34 NZDT
Update -
We are continuing to work with our storage vendors to recover the filesystems. But we do not yet have an ETA.
Dec 9, 14:51 NZDT
Identified -
Vendor support is actively engaged and working on this now.
We expect all IO will be hanging at the moment, so access to the systems and OnDemand will also be impacted. Currently running jobs will likely block when attempting IO and may continue once service is restored.
Dec 9, 13:42 NZDT
Investigating -
During an upgrade of the shared filesystem we have encountered an issue. We are working on fixing this issue and will post updates here with an ETA ASAP
Dec 9, 13:35 NZDT