After an extended monitoring period, function execution has returned to normal rates across all queue shards. During this issue, only a subset of users were affected on part of our infrastructure. Our infrastructure team is in the midst of rolling out additional system capacity going forward.
Runs, traces and events data are all caught up from their temporary backlog. The dashboard metrics are all being processed with no backlog.
The fix has been running stable for 30 minutes and all system metrics are healthy. Run traces and metrics ingestion should be near realtime again.
The incident is now resolved and the system is full operational.
During this incident the system experienced a backlog in scheduling new runs and event data ingestion for the Inngest dashboard was delayed.
The root cause was due to an issue with our internal event streams brokers which power internal event ingestion, run trace data ingestion and metrics ingestion. This caused delays in other data in the dashboard as well.