foundev.github.io

Random Kubernetes Failures and Node Size

I had a report of ingress failing on traefik randomly during a test job and since it worked on a local workstation the belief was it had to be a problem with traefik. I looked at the logs and asked them to spin up bigger nodes for the test and it fixed things.

The reason the beefier machines come into play are varied but ultimately come down to “no free lunch”. It all started with these log entries in traefik:

Error while setting deadline: set tcp 10.244.3.2:44070: use of closed network connection

So anytime there are intermittent failures node size maybe the culprit (Not a bad idea to check if there is expensive work scheduled on the master)

Final pro tip (note you need to change the namespace in 2 places)

kubectl port-forward $(kubectl get pods --selector "app.kubernetes.io/name=traefik" --output=name) 9000:9000

will open up a port for you to browse traefik and make sure it is configured correctly and everything is happy, just browse to http://127.0.0.1:9000/dashboard/#/ and enjoy