Increased Latency for customers on certain deployments from 9:00-9:40AM EST

Incident Report for Duo

Postmortem

Summary

On January 31, 2024, at around 09:00 EST, Duo's Engineering Team was alerted by monitoring that there was an increased latency for authentications for certain deployments. The root cause of the authentication latency was due to increased logging demand coupled with insufficient processing capacity to stay ahead of the increased demand.

‌

Our Engineering Teams resolved the issue within 40 minutes.

Deployments Impacted

DUO9, DUO22, DUO39, DUO40, DUO49, DUO50, DUO55, DUO56, DUO58, DUO62, DUO63, DUO64, DUO65, DUO72, DUO73

Timeline of Events EST

2024-01-31 09:02 - Duo Site Reliability Engineering (SRE) receives alert by monitoring systems about increased auth latency.

‌

2024-01-31 09:05 - SRE correlates the authentication latencies with latencies found in the logging infrastructure.

‌

2024-01-31 09:08 - SRE alerts the Subject Matter Experts (SMEs) of our logging infrastructure.

‌

2024-01-31 09:24 - The secondary log queueing mechanism was working as expected indicating no data loss.

‌

2024-01-31 09:35 - SMEs determine capacity was the root cause of the issue, begin working on adding additional infrastructure and capacity to handle the additional logging demands.

‌

2024-01-31 09:42 - Authentication latency metric begins to improve.

‌

2024-01-31 09:52 - Engineering team determines no authentications were impacted due to latency.

‌

2024-01-31 10:02 - SRE Engineers monitor the authentication latency and work on improvement proposals.

‌

2024-01-31 10:32 - SRE Engineers continue to monitor the authentication latency.

‌

2024-01-31 11:00 - Incident is resolved as authentication latency was stable for over an hour.

‌

2023-01-31 11:30 - Additional infrastructure was added to handle the additional logging demands.

‌

2023-01-31 11:35 - Incident was marked resolved on the Duo status page.

Details

On January 31, 2024, an alert about increased authentication latency was triggered and acknowledged by the Site Reliability Engineering (SRE) team. SRE engineers correlated the authentication latencies with latencies in the logging infrastructure. For security, Duo will not approve an authentication request until a corresponding log is written into some permanent storage. During this incident, the increased capacity load slowed down the writing of this log, causing latency in approving the authentication.

‌

The Subject Matter Experts (SMEs) of data infrastructure were brought in to assist and attributed the behavior to capacity provisioning. Duo Engineering had started collecting new logs on the logging infrastructure. Due to the inability to handle the increased load, some instances of the logging infrastructure had been marked as unhealthy, leading to an increase in load for all the other logs, causing a snowball effect. These issues also affected passwordless authentications.

‌

Any logs not accepted by the logging infrastructure were sent to safe backup storage, ensuring no data loss. The team determined that capacity was the root cause of the issue which was due to the addition of new logs and began working on adding additional infrastructure and capacity to handle the additional logging demands.

‌

Authentication latency improved as the system load decreased, and there was no significant increase in failed authentications during the incident. By 11:00 am, the incident was over as authentication latency had been stable for 1 hour. The process of adding additional infrastructure was completed by 11:30 am, and by 11:35 am, the incident was entirely resolved.

‌

The incident was resolved by adding additional infrastructure to accommodate increased load during US peak hours.

‌

7085 customers across multiple deployments were impacted.

‌

What is Duo doing to prevent this in the future?

‌

Duo Engineering has scaled the critical piece of logging infrastructure that failed during this incident, for all our deployments to ensure we have more than enough capacity. In addition, in the coming quarters, we will be implementing an automated way to scale this piece of infrastructure to respond to the peaks and drops of the traffic appropriately. Our authentication teams are also looking into ways to ensure we can reduce the authentication latency present in the request in an event we do have a similar incident in the future.

Posted Feb 09, 2024 - 09:36 EST

Resolved

We experienced an issue causing increased latency to our customers on certain deployments in the US-West-2 region. This could have caused delayed/timed out authentications and failover to Password authentication instead of Passwordless.

The issue resolved after 40 minutes and everything has returned to normal.

Posted Jan 31, 2024 - 09:00 EST