The incident is now resolved and the system is full operational. The issue started at 16:56:16 UTC and ended at 16:57:31 UTC.
The Event API caches event keys for a duration so not all requests would have been effected with sending events, but some may have. Our SDKs automatically retry up to 5 times with backoff, but due to the time window being >60s, many retries might have failed as well.
Our REST API and dashboards returned once the connections were re-established.
The cause of this incident is related to a networking issue that we are working on follow up mitigation efforts to avoid this issue from happening again.
Resolved
The incident is now resolved and the system is full operational. The issue started at 16:56:16 UTC and ended at 16:57:31 UTC.
The Event API caches event keys for a duration so not all requests would have been effected with sending events, but some may have. Our SDKs automatically retry up to 5 times with backoff, but due to the time window being >60s, many retries might have failed as well.
Our REST API and dashboards returned once the connections were re-established.
The cause of this incident is related to a networking issue that we are working on follow up mitigation efforts to avoid this issue from happening again.
Identified
The connection to our Postgres database from one of our Kubernetes clusters is failing due to a networking issue. Services running within this cluster, including our APIs all temporarily failed handling requests.