System latency for function execution has returned to normal levels for the affected users. The incident has been resolved.
The cause of the incident was due to increased load causing congestion. We applied changes to the system to reduce congestion, resulting in increasing throughput. We also re-distributed some affected users in an effort to mitigate impact.
Our team's planned to roll out new infrastructure in the coming weeks and is accelerating that plan to aim to roll it out later this week to increase overall capacity.
Resolved
System latency for function execution has returned to normal levels for the affected users. The incident has been resolved.
The cause of the incident was due to increased load causing congestion. We applied changes to the system to reduce congestion, resulting in increasing throughput. We also re-distributed some affected users in an effort to mitigate impact.
Our team's planned to roll out new infrastructure in the coming weeks and is accelerating that plan to aim to roll it out later this week to increase overall capacity.
Monitoring
The configuration change made earlier has increased throughput and reduce latency for affected users. The impact of this change takes up to an hour to roll out. Our internal metrics are seeing p75 and p90s return to normal levels with some anomalies in p95 and p99 execution latency, but generally closer to normal. We continue to monitor and investigate long term mitigations.
Investigating
We have made a configuration change in the system to unlock additional throughput in an attempt to reduce the bottleneck. System throughput is increasing in some affected part of the system.
Investigating
We're working to mitigate the slowness by re-allocating workloads across our queue shards. Additionally, we're provisioning more capacity for workloads to alleviate pressure on the system queues.
Investigating
We are actively investigating an issue with one of our queue shards experiencing higher than usual delays with function execution. We will provider further updates as we identify the cause and resolve the issue.