Resolved
System backlogs should be caught up and the system overall is stable with increased capacity. The root cause was determined and several system performance and reliability improvements will be added this week as follow ups.
Monitoring
All affected users should be in a stable state, while the system catches up with execution during the migration from the affected queue shard. As things stablize, we'll migrate these users over to a new dedicated queue shard with additional capacity.
Identified
All users were migrated off of the ss2 shards an hour ago to ensure all functions are processed to mitigate issues. All users should be on stable queue shards.
Identified
Our latest attempt to add more replicas to our queue did not succeed so queue workers have been brought back online. We are not working to shift all accounts off of the queue (ss2) temporarily to stabilize before re-distributing accounts.
Identified
We are actively working to stabilize the ss2 queue. During this process, function execution on this shard may be reduced for 10-20 minutes.
Monitoring
We have scaled up workers for our ss2 queue shard to get caught up from the backlog while we also get system hardening measures in place.
Monitoring
We've fixed the ss2 queue shards and function execution should begin to resume around 17:58 UTC.
Identified
We have identified the cause of the issue. We're actively working on implementing a fix to restore the ss2 queue shard.
Investigating
We are actively investigating an issue with one of our queue shards (ss2) that handle function execution for a subset of our customers. We will provider further updates as we identify the cause and resolve the issue.