Resolved
We've monitored the system after the fixes and the incident is now resolved.
Monitoring
All affected customers have been re-distributed to their original shard (ss2). Execution delays will continue to reduce over the next hour or so. Execution for users originally on queue shard ss2 should be caught up. Some functions may have executed out of order during this time period.
Monitoring
We are actively re-distributing affected users to the fixed shard. Temporarily moving these users resulted in degraded performance. In parallel we were continuing our investigation into the root cause and adding measures for system hardening.
Monitoring
The system is processing work from the queue shard that was originally affected (ss2). Some runs that were effected are executing and catching up. Some of these runs will execute out of order as the new runs were shifted to another queue shard (queue6).
Identified
We have shifted all affected users over to another queue shard. Runs that were active when the queue stopped at 18:15 UTC may be stuck temporarily and we are working to bring those stuck runs back and continue processing them.
Investigating
We are actively investigating an issue with function execution on one of our queue shards. We will provider further updates as we identify the cause and resolve the issue.