Service disruption on June 4, 2020

peter · June 4, 2020, 3:40pm

On June 4, 2020, deSEC DNS operations experienced a partial service disruption. The issue was caused by incomplete replication of zones after their DNSSEC signatures had been renewed, leading to DNSSEC validation failures for certain DNS queries. The incident started at 12am UTC and was resolved around 7am UTC.

Impact

No data was lost or compromised: our services were not under attack, nor did a physical failure occur.

Signatures on domains that had not been updated since May 28, 2020, were not correctly refreshed (see explanation below). As a consequence, DNSSEC-aware clients or resolvers rejected those expired signatures. Unfortunately, this also included queries to resolve the names ns1.desec.io or ns2.desec.org in order to consecutively query other names, effectively blocking DNS resolution. Domains whose contents had been updated more recently were provisioned correctly on the frontend DNS servers.

DNS clients or resolvers that do not perform DNSSEC validation were not affected and were able to query all domains at any time.

Explanation

The root cause of the issue was a change in our replication mechanism that we deployed on May 26, 2020. It introduced a bug that caused DNS frontend servers to receive zone updates only if a DNS zone’s content was changed via the deSEC API (including the dedyn.io update endpoints). Zones without changes were not replicated to the frontend DNS servers.

However, zones without changes still need to be updated periodically on the frontend servers. This is because DNSSEC signatures in PowerDNS (our backend nameserver) are valid until the next-to-next Thursday (UTC time) after which they expire. When the bug was introduced on May 26, 2020, signatures therefore were valid until June 4, 2020, 12am UTC.

Although new signatures continued to be computed weekly, including on Thursday, May 28, 2020, they were no longer replicated to the frontend servers automatically. Replication only happened if zone contents were updated by other means (via our API), in which case replication was triggered correctly (including the new signatures).

The issue was fixed by manually triggering ad-hoc replication for all domains immediately after it was discovered. Although visible in our monitoring systems, the problem had gone unnoticed for several hours as it happened at night. We previously had started work on an alerting system which is able to wake up an on-call deSEC administrator, but alas, it is not yet fully in place.

Next steps

We will fix our replication method to correctly propagate signature updates to all frontend DNS servers.

Perhaps more importantly, we are working to put the alerting system into operation with the highest priority, so that any service disruptions – should they happen in the future – will get noticed much more quickly.

How can I help?

We do apologize especially for the lag in detecting the incident, and hope at the same time that it may serve as a reminder that deSEC eventually is a community-driven project.

We rely on active contributions from our users, and while we’d love to establish a reliable 24/7 incident response mechanism, we cannot do so without your sustained support. If you like what we’re doing, please consider a contribution!

Stay secure,
The deSEC team