The incident is now resolved and the system is full operational. This was related to an issue caused by the part of the system powering the debounce feature. The internal event backlog is fully caught up and the two mitigations deployed have addressed the issue. The team is preparing a post-mortem to ensure this issue does not reoccur.
After an extended monitoring period, we are resolving this incident. The system is full operational.
After an extended monitoring period, function execution has returned to normal rates across all queue shards. During this issue, only a subset of users were affected on part of our infrastructure. Our infrastructure team is in the midst of rolling out additional system capacity going forward.
Runs, traces and events data are all caught up from their temporary backlog. The dashboard metrics are all being processed with no backlog.
The fix has been running stable for 30 minutes and all system metrics are healthy. Run traces and metrics ingestion should be near realtime again.
The incident is now resolved and the system is full operational.
During this incident the system experienced a backlog in scheduling new runs and event data ingestion for the Inngest dashboard was delayed.
The root cause was due to an issue with our internal event streams brokers which power internal event ingestion, run trace data ingestion and metrics ingestion. This caused delays in other data in the dashboard as well.