Erratic 502 Bad Gateway responses on update.dedyn.io

ole · April 3, 2025, 11:15am

I switched to desec recently. Using ddclient I had troubles updating IPv6, so I implemented my own update mechanism as a small python script triggered via cron every 5 minutes.
This script erratically reports 502 Bad Gateway responses. There’s no pattern, at times it’s worse, at times it’s ok.
Some numbers: yesterday, I saw 19 of these responses, the day before 9 (for half a day), today in half a day (until noon) I saw 4 of them. Over 2 days, that is a rate of occurrence of >10%.
Is this normal (expected) behaviour? Does anyone else see these 502 responses/rates?

fiwswe · April 3, 2025, 11:22am

According to https://desec-status.net everything is running fine. But I don’t know what that site measures/tests exactly.

black · April 4, 2025, 1:57pm

Same here. I’ve been using curl for my DynDNS needs for years and also get a fair share of 502 responses. Sometimes more, sometimes less.
It never bothered me much, though. Since the IP does not change that often and only one update needs to succeed, the hostname usually points to the correct IP when I need it.

nils · April 4, 2025, 3:08pm

Could you all please share the IP address you are connecting to, how your http request look like (headers sent etc.), and if you’re using TLS?

Seeing 502 every now and then doesn’t sound normal, we shall look into it.

Best regards,
Nils

black · April 4, 2025, 5:51pm

Hi Nils,
thank you for looking into the issue. My curl command looks like this:

curl --silent --header 'Authorization: Token ...' 'https://update.dedyn.io/update?hostname=example.dedyn.io'

I don’t have debug logs, so I don’t know to which address update.dedyn.io resolves when the error occurs. It’s an IPv4 address for sure, as the system I use this on does not have IPv6 connectivity.
I’ll try reproducing the issue with some logging and provide the IP address and anything else that seems noteworthy.

fiwswe · April 5, 2025, 3:38am

I have set up a test script that uses curl(1) to update a hostname and log the results including headers. It caught 4 (four) 502s in the last ≈10 hours running every 5 min.

Note: I deliberately used HTTP/1.1 not HTTP/2 here on the assumption that many DDNS clients might not support HTTP/2 yet. I don’t know yet if that makes difference or not.

The curl command used was:
curl --url "$URL" --header "$AUTH_HEADER" --show-headers --silent -v --http1.1 >> "$LOGFILE" 2>&1

The result looked like this (anonymised):

* Host update.dedyn.io:443 was resolved.
* IPv6: 2a01:4f8:10a:1044:deec:642:ac10:80
* IPv4: 88.99.64.5
*   Trying [2a01:4f8:10a:1044:deec:642:ac10:80]:443...
* ALPN: curl offers http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [307 bytes data]
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Unknown (8):
{ [47 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2657 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=dedyn.io
*  start date: Mar 26 19:05:13 2025 GMT
*  expire date: Jun 24 19:05:12 2025 GMT
*  subjectAltName: host "update.dedyn.io" matched cert's "update.dedyn.io"
*  issuer: C=US; O=Let's Encrypt; CN=R10
*  SSL certificate verify ok.
*   Certificate level 0: Public key type ? (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type ? (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type ? (4096/128 Bits/secBits), signed using sha256WithRSAEncryption
* Connected to update.dedyn.io (2a01:4f8:10a:1044:deec:642:ac10:80) port 443
* using HTTP/1.x
> GET /?hostname=myhost.example.com&myipv4=192.51.0.50&myipv6=2001:db8::192:51:0:50 HTTP/1.1
> Host: update.dedyn.io
> User-Agent: curl/8.12.0
> Accept: */*
> Authorization: Token …mytoken…
> 
* Request completely sent off
< HTTP/1.1 502 Bad Gateway
< Server: nginx
< Date: Sat, 05 Apr 2025 02:50:02 GMT
< Content-Type: text/html
< Content-Length: 150
< Connection: keep-alive
< Strict-Transport-Security: max-age=31536000
< 
{ [150 bytes data]
* Connection #0 to host update.dedyn.io left intact
HTTP/1.1 502 Bad Gateway
Server: nginx
Date: Sat, 05 Apr 2025 02:50:02 GMT
Content-Type: text/html
Content-Length: 150
Connection: keep-alive
Strict-Transport-Security: max-age=31536000

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

I can try this with HTTP/2 or with IPv4 as well if needed. Let me know.

HTH
fiwswe

fiwswe · April 5, 2025, 7:28am

This was redundant, sorry.

And I have now also tested with HTTP/2 and still got the occasional 502.

fiwswe · April 5, 2025, 9:10am

IPv4 also triggers the occasional 502.

I have published the test script on GitHub.

ole · April 5, 2025, 11:50am

Thanks @fiwswe for the thorough investigation. @nils let us know if you need additional information.

black · April 5, 2025, 5:49pm

After running my curl command with additional logging (similar to @fiwswe’s setup) every five minutes for almost 24 hours now, I can report not a single 502 response. This is a bit of a surprise. I expected a few errors in that interval.

All responses came from 88.99.64.5, if that is any help.

I’ll keep it running like this some more and report back if I catch one of those 502 again.

fiwswe · April 6, 2025, 3:10am

I did another test using IPv4 (same IP @black noted) and HTTP/2. In ≈7 hours there were four 502 responses.

$ grep -A 2 '< HTTP/2 5' desec-test-ddns.log
< HTTP/2 502 
< server: nginx
< date: Sat, 05 Apr 2025 22:50:02 GMT
--
< HTTP/2 502 
< server: nginx
< date: Sat, 05 Apr 2025 23:50:01 GMT
--
< HTTP/2 502 
< server: nginx
< date: Sun, 06 Apr 2025 00:00:02 GMT
--
< HTTP/2 502 
< server: nginx
< date: Sun, 06 Apr 2025 01:20:01 GMT
$

black · April 6, 2025, 7:01am

I finally got a 502 response. Full output of curl’s stderr follows. Not sure if that adds any substantial information, though.

curl stderr

* Connected to update.dedyn.io (88.99.64.5) port 443 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [41 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2657 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=dedyn.io
*  start date: Mar 26 19:05:13 2025 GMT
*  expire date: Jun 24 19:05:12 2025 GMT
*  subjectAltName: host "update.dedyn.io" matched cert's "update.dedyn.io"
*  issuer: C=US; O=Let's Encrypt; CN=R10
*  SSL certificate verify ok.
} [5 bytes data]
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /update?hostname=example.dedyn.io]
* h2h3 [:scheme: https]
* h2h3 [:authority: update.dedyn.io]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* h2h3 [authorization: Token ...]
* Using Stream ID: 1 (easy handle 0x203db88)
} [5 bytes data]
> GET /update?hostname=example.dedyn.io HTTP/2
> Host: update.dedyn.io
> user-agent: curl/7.88.1
> accept: */*
> authorization: Token ...
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [281 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/2 502 
< server: nginx
< date: Sun, 06 Apr 2025 06:20:02 GMT
< content-type: text/html
< content-length: 150
< strict-transport-security: max-age=31536000
< 
{ [150 bytes data]
* Connection #0 to host update.dedyn.io left intact

fiwswe · April 6, 2025, 9:36am

On the assumption that the 502 responses might be caused by a temporary overload of the (Python?) backend I have modified my test to still run every 5 min. but with a delay of 15 seconds.

This has been running for almost 6 hours now without any 502s.

My thinking is that many DDNS clients will use cron(1) to run and cron(1) will run at exact minutes — barring slight differences in the local clock and processing time for determining the need to call the API. So the probability of many clients triggering at full minutes is very high. Probably even higher at 5 min. intervals. Delaying for 15s should make the script run at a time where the probability is lower that others are calling the API. Thus the probability of the backend being overloaded should be lower.

@Nils I don’t know if you do any internal logging to prove or disprove this theory? Do the Nginx logs show any such clustering around full minutes or full 5 minute intervals?

black · April 6, 2025, 10:06am

My own tests did not use cron but a simple shell loop with a 5 minute delay (plus some jitter due to the time curl took). So, I essentially already avoided the full minutes. That might explain why only 1 of 439 requests failed during my test. And that happened to be at 06:19:59.

My production setup, on the other hand, uses cron (*/15) and has a higher perceived error rate.

nils · April 7, 2025, 10:10am

Thanks everyone for reaching out and putting together a detailed report!

This error is due to high workload at the server. We currently serve up to 64 requests concurrently; that number is exceeded during ‘rush hour’. Typical times are at the top of the hour, at xx:15hrs, xx:30hrs, xx:45hrs; at the beginning of each minute; etc.

Everyone, be a good user and schedule your cron jobs at randomly picked moments of the hour and minute, avoiding the rush hour, don’t send no-op requests.

We will discuss on GitHub if we can increase the number of requests we can process in parallel, but this directly translates into memory usage, which directly translates into server cost Euros.

Perhaps we need to further limit the amount of requests a user can make so that people stop sending us no-op requests that just result in busy work (read: unnecessary cost) at the server.

Best solution would be if someone wrote an update client that deserves the name and reduces the number of no-op requests and randomizes query times.

Best,
Nils

black · April 7, 2025, 8:03pm

Reasonable. However, I don’t think many user will see this advice here. I suggest adding it to the docs. And maybe it would make sense for you to customize the 502 error response to include information on the problem and how user can solve it from themselves and for others.

I wonder what kind of limit would help. Limiting to one update in 5 or 15 minutes would probably just make people schedule their cron jobs to */5 or */15, contributing to rush hour problem. Something like once per hour might get people to strive for more frequent checks without no-op updates. But it seems like a rather harsh limit.

ddclient does avoid no-ops. When running in daemon mode, it seems to work at a fixed interval, but starting from the point in time when the daemon was started - which should be essentially random.
Systemd timers support randomized delays out of the box. With cron you’d have to get a little creative, but it’s doable.
It’s all there, but some people don’t use it. Why? Some speculation:

Personally, I did not want to bother with the complexity of ddclient and thought a simple curl request would be much easier. And it actually is. For me, that is, but not for you. I was not aware of the implications of my approach. A note on the cost of no-ops in the docs would have made me consider better options.

@ole mentioned issues with ddclient and IPv6 as the motivation for rolling their own python script. Maybe the docs could provided a configuration example for ddclient that better covers the most common use cases, fewer people would look for home-brew alternatives.

Adding a few examples on how to randomize the timers and cron jobs to the docs might help with the rush hour problem.
I realize that documenting how to use basic system software like cron and systemd timers is not really in the scope of your API documentation. But I think it may be worth it. If you want people to behave in a certain way, make it as easy as possible from them to do so.

GenericUser · April 9, 2025, 8:01am

Other dynamic DNS services simply penalize no-op requests. If the client has lost local state, it is still possible to query the DNS and see if any changes are necessary, so no-ops should technically not be necessary ever. Requiring updates to avoid expiration of inactive domains however incentivizes useless churn.

fiwswe · April 9, 2025, 1:29pm

Just a quick update: The mitigations put in place by the deSEC team yesterday have lowered the frequency of the 502 responses to ≈1/day (down from ≈18/day) in my testing. Tests were performed using cron(1) */5 without any additional delays.

I have some suggestions to further improve this on the client side:

Premise: Dynamic IPs typically change once a day or less. The exception might be after an outage or when the Internet router is restarted.

Depending on your client, either only react to changes (as would likely be the case for built-in clients on Internet routers) or use a client that polls less frequently, updates only when a public IP actually changed and randomises the exact update time.

Detailed suggestions

The details of now to accomplish this vary depending on your operating system. So the next points have to stay somewhat vague.

Polling every 5 min. or less is probably sufficient in most cases. If you know that your public IPs will probably change during the night, i.e. at a time when you are probably sleeping, then use a longer interval because you probably don’t care about an outage while asleep.
If you use cron(1) to trigger the checks, randomise the time. ~/5 * * * * would choose a random 5 minute interval for example. So it might run at …:07:00, …:12:00, … instead of …:05:00, …:10:00, etc. See your crontab(5) man page for details. Even better use something like ~/11 * * * * to get a randomised and shifting longer time.
Additionally add a random amount of seconds to delay execution when using cron(1). cron(1) executes its jobs more or less exactly on a full minute. And since most hosts use NTP servers to synchronise their clocks, this translates to high server load at these times. Use e.g. sleep(1) with a random integer generated using jot(1), shuf(1), awk(1), or whatever methods are available on your platform to get a random integer in the range from say 10-50 for use as the parameter for sleep(1).
Some examples:
- sleep $(awk -v min=10 -v max=50 'BEGIN{srand(); print int(min+rand()*(max-min+1))}')
- sleep $(jot -r 1 10 50)
- sleep $(shuf -i 10-50 -n 1)
- And if you are using something other than a *NIX shell (Python, PHP, perl, …) you have even more options.
Then have your client compare the current public IPs to the ones currently set in the DNS records. Only call the IP Update API if there actually was a change. And verify that your method to determine a public IP actually yielded a valid IP before using the result. The same goes for the DNS query. Don’t trigger an update based on faulty or incomplete data!
Make sure you check the result of the update request. If it is not good or the HTTP status was not 200 then something went wrong and you need to retry the update. Don’t retry too often! Use the same judicious approach to retries that you used for the original update attempt. Ideally back off using growing delays between retries (up to a maximum delay).
It takes time for the results of a successful update request to propagate. There is a delay for the update server to modify the authoritative nameservers. Then there is the DNS TTL of 60 seconds which allows caches to still serve the stale data. So after a successful update don’t try to check again for a while. Typically a 2 minute delay would probably be sufficient. But a 10 min. delay would not cause much grief either. The longer the delay, the less traffic and workload you generate.