Major incident - Weka Filesystem Not Available

Incident Report for REANNZ Advanced Computing and Data Services

Resolved

This incident has been resolved.

Posted Dec 15, 2025 - 16:29 NZDT

Monitoring

WEKA support have managed to restore service and the original storage cluster version upgrade is now continuing in the background. Access to Mahuika/HPC3 services is now restored and most running jobs appear to have survived the storage outage. Some jobs may have completed in a failed state in Slurm - users should review outputs before rerunning failed jobs.

We expect continued intermittent performance issues while the upgrade completes and are monitoring this closely.

Apologies for any disruption this caused to your work this afternoon!

Posted Dec 09, 2025 - 16:58 NZDT

Update

The storage system is slowly being recovered. There may be some filesystem access available but performance will be very degraded. We hope to be fully resolved in about an hour, but it could take longer

Posted Dec 09, 2025 - 16:34 NZDT

Update

We are continuing to work with our storage vendors to recover the filesystems. But we do not yet have an ETA.

Posted Dec 09, 2025 - 14:51 NZDT

Identified

Vendor support is actively engaged and working on this now.

We expect all IO will be hanging at the moment, so access to the systems and OnDemand will also be impacted. Currently running jobs will likely block when attempting IO and may continue once service is restored.

Posted Dec 09, 2025 - 13:42 NZDT

Investigating

During an upgrade of the shared filesystem we have encountered an issue. We are working on fixing this issue and will post updates here with an ETA ASAP

Posted Dec 09, 2025 - 13:35 NZDT

This incident affected: Data Transfer, Submit new HPC Jobs, Jobs running on HPC, NeSI OnDemand, and HPC Storage.