Monitoring - Kia ora all, here is a belated update on this incident.

From approx 2200hrs on Thurs 4th September, tenant network and associated floating IP connectivity to all FlexiHPC VM instances started going offline. A subset of instances also went into the SHUTOFF state. This was the result of a config and automation regression in our OpenStack infrastructure. A config fix was rolled out approx 0400hrs on Friday the 5th which resolved network connectivity issues for impacted instances.

However, a subset of instances that were SHUTOFF by the original issue were additionally impacted by a serious corner case in the OpenStack deployment tooling that we use. This resulted in duplicate VM instances being launched, which in turn meant that some instance root drives and attached volumes were inadvertently multi-attached, which could lead to instance availability and potential data corruption issues. Our team have since worked tirelessly with impacted instance owners to address the follow-on issues and recover services. If you are still experiencing any issues, please contact support. We apologise for the disruption to service and are making adjustments to our processes to minimise the possibility of similar problems in future.

Sep 11, 2025 - 22:27 NZST
Identified - All the impacted production services that we are currently aware of have been restored. There are on the order of 100 instances impacted that have duplicate virtual machine processes, though many of these are dev/test machines. These require operator intervention to attempt restoration of service. If you have an active instance that you can no longer log onto, or an instance that is shutoff and will not start up, please open a support ticket.
Sep 05, 2025 - 15:24 NZST
Investigating - We had some serious issues from about 10pm last night that interrupted the lander node, Ondemand, and numerous Openstack instances. If you have VMs that were shutdown please try restarting them, if this is not successful please log a support ticket.
We are currently assessing what services may still be down. Slurm is available and jobs are running. Ondemand is available.

Sep 05, 2025 - 08:46 NZST
Update - The filesystem has been stable today, however several users have reported degraded interactive IO experience. We expect this is caused by ongoing heavy metadata load as a result of the continuing background integrity check. Based on current progress we unfortunately expect this to continue into next week.

There have been no major impact to jobs, though some workloads paused when trying to write to the filesystem during the incident, and as a result a few jobs have timed out. If you see this and need help resolving it then please contact support.

Aug 28, 2025 - 16:08 NZST
Update - We are continuing to monitor for any further issues.
Aug 28, 2025 - 00:16 NZST
Monitoring - Full filesystem functionality was restored at approx 11pm NZST. The issue appears to have been triggered by a brief backend network disruption - WEKA support are investigating why the filesystem didn't recover automatically. Ongoing data integrity checks may impact IO performance for a while longer.

Thankfully there seem to be no widespread job impacts, however we will check this more thoroughly in the morning and contact any users who may have had work impacted. Apologies again for the disruption (and goodnight)!

Aug 28, 2025 - 00:16 NZST
Investigating - We have identified an ongoing issue with our high performance filesystems. This is impacting scratch/nobackup, project, home and likely impacting any new logins to the HPC and OnDemand services. At present, existing jobs are continuing to run and complete, however we anticipate there may be job failures as a result of this problem. We are currently awaiting urgent vendor support. Apologies for the inconvenience and disruption, we'll update as soon as we know more.
Aug 27, 2025 - 21:19 NZST
Investigating - We are investigating intermittent errors occurring using Globus to the new HPC platform. Failures often rectify themselves on retry but we are working to improve the stability of this service.
Jul 02, 2025 - 11:32 NZST

About This Site

New Zealand eScience Infrastructure High Performance Compute and Storage Service Status

Apply for Access ? Operational
Data Transfer Degraded Performance
Submit new HPC Jobs Operational
Jobs running on HPC Operational
NeSI OnDemand ? Operational
90 days ago
99.95 % uptime
Today
HPC Storage Degraded Performance
User Support System ? Operational
Flexible High Performance Cloud ? Operational
Long-term Storage (Freezer) ? Operational
90 days ago
99.92 % uptime
Today
Flexible High Performance Cloud Services ? Operational
90 days ago
100.0 % uptime
Today
Virtual Compute Service Operational
Bare Metal Compute Service Operational
FlexiHPC Dashboard (web interface) ? Operational
90 days ago
100.0 % uptime
Today
FlexiHPC CLI interface ? Operational
90 days ago
100.0 % uptime
Today
Public API of the FlexiHPC Service ? Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.

Scheduled Maintenance

WEKA filesystem and compute core changes - Announcement Sep 8, 2025 09:00-17:00 NZST

In order to get our new WEKA filesystems up to their best possible performance we need to dedicate some cores on each compute node for exclusive use by WEKA. So on Milan nodes Slurm jobs will only be able to use 126 cores per node rather than 128, and on Genoa nodes 166 rather than 168.
This change has already begun, having already been applied to all of the Milan nodes and the majority of the Genoa nodes. We expect to be doing the last of the Genoa nodes on September 8th.

Posted on Sep 04, 2025 - 12:10 NZST

Freezer.nesi.org.nz: renaming of the buckets with underscore to dash Sep 22, 2025 13:30-14:30 NZST

Update - We will be undergoing scheduled maintenance during this time.
Sep 08, 2025 - 16:24 NZST
Scheduled - We will be undergoing scheduled maintenance during this time to update the system.
To align with Amazon S3 standards we needed to make a change to the bucket names for many Freezer allocations.
What will be changing:
Previously we had a naming policy bucketname_uniqueidentifier, where you could update the name as part of the allocation process. We need to remove the underscore and replace it with a hyphen.
Your Freezer allocation is now bucketname-uniqueidentifier.
You can access Freezer exactly the same as before but you will need to use the new bucket name(s). If you use _ the system will fail.

Sep 08, 2025 - 16:19 NZST
Sep 16, 2025

No incidents reported today.

Sep 15, 2025
Resolved - This incident has been resolved.
Sep 15, 16:53 NZST
Monitoring - The slurm database is now available and sacct commands are working.

We are analysing the hypervisor for root cause.

Sep 15, 15:30 NZST
Investigating - The hypervisor that hosts the slurm database instance has died, we are working on rebooting it now.
Slurm jobs will continue to run, but the sacct command will not work

Sep 15, 15:21 NZST
Sep 14, 2025

No incidents reported.

Sep 13, 2025

No incidents reported.

Sep 12, 2025

No incidents reported.

Sep 11, 2025

Unresolved incident: Overnight problems.

Sep 10, 2025

No incidents reported.

Sep 9, 2025

No incidents reported.

Sep 8, 2025

No incidents reported.

Sep 7, 2025

No incidents reported.

Sep 6, 2025

No incidents reported.

Sep 5, 2025
Completed - The scheduled maintenance has been completed.
Sep 5, 15:00 NZST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 5, 14:25 NZST
Scheduled - We are working on an issue impacting the Slurm Database. There may be some disruption, e.g., sacct command not working, while we back up the DB the attempt to resolve the situation.
Sep 5, 14:25 NZST
Sep 4, 2025
Completed - The scheduled maintenance has been completed.
Sep 4, 20:00 NZST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 4, 19:00 NZST
Scheduled - At this time we are looking to do maintenance on the main login node for HPC as we look to improve file system performance

During this time the node will be rebooted, any connections made to the HPC cluster during this time will end up on the smaller login nodes

We will also be moving new connections onto the smaller login node from 5:00 pm as we look to reduce the number of users that will be connected during the main reboot time of 7:00 pm

Running jobs will not be affected and you will still be able to access the HPC during this time.

We apologize for any inconvenience caused during this time

Sep 4, 11:28 NZST
Completed - The scheduled maintenance has been completed.
Sep 4, 10:15 NZST
Verifying - Verification is currently underway for the maintenance items.
Sep 3, 12:41 NZST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 3, 11:00 NZST
Scheduled - Changes to our keycloak functionality will prevent all new logins to Ondemand and ssh to Mahuika (new) for 30mins to 2 hours from 1100hrs Wed Sep 3rd.
my.nesi will also be inaccessible throughout the maintenance period

Sep 2, 09:11 NZST
Sep 3, 2025
Sep 2, 2025

No incidents reported.