The incident is now resolved and the system is full operational.
The Event API partial downtime occurred between 06:26 UTC and 07:58 UTC with degraded performance continuing until 09:20 UTC (see incident component timeline).
During this time window, events may have failed to have been sent successfully.
Resolved
The incident is now resolved and the system is full operational.
The Event API partial downtime occurred between 06:26 UTC and 07:58 UTC with degraded performance continuing until 09:20 UTC (see incident component timeline).
During this time window, events may have failed to have been sent successfully.
Monitoring
We’re still addressing complications with some brokers that remain in a broken state and are actively working to restore them.
We’re also observing increased execution lag, partially caused by the current system pressure. We expect latency to start decreasing once the issue is resolved and will continue to closely monitor the situation.
Monitoring
We have spun up two additional Kafka brokers and are working to rebalance the cluster to reduce pressure on the existing brokers.
We’re closely monitoring the system to ensure it remains fully operational.
Identified
We’ve identified the root cause of the issue: two of our Kafka brokers are experiencing disk pressure, which is preventing our Event API from publishing to Kafka. We are actively working to restore the affected brokers as quickly as possible.
Investigating
We are actively investigating an issue with increased error rates with our Event API and a portion of our data pipeline that handles the function run status deliveries.
Investigation is showing errors started appearing ~06:26 UTC and started increasing from there.
We will provide further updates as we identify the cause and resolve the issue.