Resolved
After 18 hours of monitoring, the fix deployed yesterday resolved the incident.
Monitoring
A fix has been deployed and we're monitoring the system to ensure the system is fully operational.
Identified
The team has deployed several system changes in an effort to improve system performance. We've measured clear impact, but we continue to investigate and profile the system to surface additional bottlenecks and address them.
Identified
One of our state databases, which stores system state, like what functions should be cancelled and other data, is seeing increased load which is causing broad system performance issues in function execution.
The team is prepping a change to add additional optimizations on this part of our system to help with queries to the state database to alleviate pressure and improve performance. We aim to get this change out within 30m.
Identified
We have identified the cause of degraded function execution performance. We're actively working on implementing a fix to resume normal operation of the system.
Reviewing system data, performance issues began around 9 UTC and have started to affect more users in the time since.