I’m working on a home lab project to try to create a highly available service across 3 IPs… These IPs will answer in round robin fashion by assigning the service A record multiple IPs.
Healthy server assumptions:
will return an SSL certificate matching the domain
will respond with HTTP Status codes in the 200-300 range
will contain valid sanity check content, either:
A keyword/string in the contents of the page that should appear if content is rendering correctly, i.e. Welcome to example.com
OR in the case of an app a healthy status from an endpoint such as Spring Actuator returning {"status": "UP"} would from /actuator/health when an app is fully started
I’m hoping to achieve the following:
Have each server create an instance A record for itself (www1.example.com) - others will use this for health-checking
On Start or IP Update At scheduled interval, the server will perform a health check on all servers who have A records associated with the service name (www.example.com) and if any of the server records are missing or if the servers with IPs listed don’t respond healthy (as defined above) then update www.example.com’s A record with the remaining healthy set of server IPs.
For this to work we need to have A records with relatively short TTLs, so in the event that a server’s IP changes or a server goes offline we will be able to replace the service A record quickly and avoid redirecting some portion of our traffic at the old/dead server’s IP.
Since the 3 servers will not know what they are each doing with respect to health checks, we’ll need to space out health checks and update operations such that no server runs more often than every $NUMBER_OF_SERVERS * $NUMBER_OF_API_UPDATES minutes… This will hopefully allow sufficient time for each health check to run and make 2 updates spaced out, first to the servers IP (www1) and then to the main service IP (www.).
We should be able to achieve this with a set of cron jobs, one on each server:
# Cronjob for server #1 - checks @ :00, :06, :12
0/6 * * * * ~/health-check-and-update.sh www1
# Cronjob for server #2 - checks @ :02, :08, :14
2/6 * * * * ~/health-check-and-update.sh www2
# Cronjob for server #3 - checks @ :04, :10, :16
4/6 * * * * ~/health-check-and-update.sh www3
Hopefully with this setup, we can achieve under ~2 minutes of outage on any traffic being routed to a dead node.
The health checks and dynDNS updates are done from the same webservers as the ones which answer the web requests? I think giving your webservers update credentials for the DNS is a security issue.
Why is there a throttling issue with health checks? I’d assume, as long as the health checks are ok, no update is needed. Only in the case of an update, the throttling might come into play.
Maybe I’m not understanding the inner workings correctly. Can you publish a (redacted) version of health-check-and-update.sh?
The way I envision it is health checks run from any machine at a site against www servers at other sites. In the same location, but not necessarily on the same machine.
I’m my case I plan on having a Raspberry Pi 4 (which is running Home Assistant) in the same location as www1 (running on a docker host) do health do the health checks against www2 and www3.
The health checks need to be spaced out because if one fails, then they will need to call the update.dedyndns.io API twice; calls to that API endpoint are allowed once a minute per the API request throttling documentation.
So, if the IP at site1 changes, then site1 will push an update to www1 and then need to wait 1 minute before pushing a second update www with the new IP for www1 and the existing healthy IPs for www2 and www3.
If one were to allow all sites to run health checks every minute, site2 (where www2 is hosted) might update www nuking www1 which fails a health check because it hasn’t updated at the same time site1 is trying to publish a new IP for www1… Whichever operation hits the API first will succeed and the others will fail with a message that must wait 60 seconds to invoke the API again.
Essentially, it would be impossible to predict what will happen… spacing them out gives them each their own time to execute without hitting the API rate limit.
Still writing it, but sure I can do that.
Peter has also suggest I look at the HTTPS DNS record type to see if that could help… may not apply because one use case would be connecting together a cockroach db cluster, but we’ll see I suppose.