Coeus HPC cluster switches in degraded state
Incident Report for Portland State University
Resolved
The Coeus HPC cluster leaf switch has been replaced and affected compute
nodes brought back on line. This system is now running normally.
Posted Aug 06, 2018 - 15:27 PDT
Update
The replacement OPA leaf switch for the Coeus HPC cluster should arrive soon. The cluster should be fully operational later today.
Posted Aug 06, 2018 - 09:59 PDT
Update
The replacement OPA leaf switch for the Coeus HPC cluster should arrive soon. The cluster should be fully operational later today.
Posted Aug 06, 2018 - 09:14 PDT
Update
The Coeus HPC cluster will continue to run down 30 compute nodes (long and interactive partitions) while we wait for a replacement switch. Users are still able to run jobs on the medium, phi and himem partitions.
Posted Aug 03, 2018 - 10:35 PDT
Monitoring
The OPA leaf switch failure will continue to make file services compute
nodes 97-128 unavailable. We are working with the system vendor and Intel
to get a replacement switch.
Posted Aug 02, 2018 - 19:24 PDT