I have a small Kubernetes cluster in the loft. For about two years it advertised its load-balancer addresses with the usual layer 2 trick: MetalLB in ARP mode, one node answering for each VIP, failover by gratuitous ARP when a node falls over. It works. It mostly works. The "mostly" is the bit that wore me down.
The failure mode is that ARP failover is only as fast as the stale entries time out, and switches and clients disagree about how long that should be. A node reboots, the VIP moves, and for some unpredictable window a chunk of my LAN is still pointing at the dead node. Pinholing the odd dropped connection is fine for a hobby. It is less fine when the thing behind the VIP is the DNS resolver the whole house depends on, and someone is mid-call.
So I did the thing I'd been circling for a year. I turned on BGP.
Why BGP at home is not as daft as it sounds
The pitch is simple. Instead of one node pretending to own a VIP via ARP, every node that can serve a VIP tells the router "I have a route to this /32, via me". The router gets the same /32 from several next-hops and does ECMP across them. When a node dies, the BGP session drops, the route is withdrawn, and traffic stops going there in seconds rather than whenever ARP feels like it. No gratuitous anything. The control plane does the failover for you, and it does it the way the entire internet does it.
You need two halves. A router that speaks BGP, and something on the cluster side to peer with it. My router runs OpenWrt, and OpenWrt has FRR in the package feeds, so both halves are FRR in the end. MetalLB has a BGP mode that does the cluster side natively, which is the part that makes this tractable rather than a weekend of hand-rolled config.
The actual config
Pick a private ASN. The 16-bit private range is 64512–65534, which is plenty. I gave the router 64512 and the cluster 64513. Keep it boring.
On the router, the FRR side is small:
router bgp 64512
bgp router-id 10.0.0.1
no bgp ebgp-requires-policy
neighbor cluster peer-group
neighbor cluster remote-as 64513
neighbor 10.0.10.11 peer-group cluster
neighbor 10.0.10.12 peer-group cluster
neighbor 10.0.10.13 peer-group cluster
address-family ipv4 unicast
neighbor cluster activate
maximum-paths 4
exit-address-family
The two lines that matter and that you will forget are no bgp ebgp-requires-policy, because modern FRR refuses to advertise anything across an eBGP session without an explicit policy and silently gives you nothing, and maximum-paths 4, because without it the router installs exactly one of the next-hops and you have built a more complicated single point of failure.
The MetalLB side is a config object rather than a daemon config, which I appreciate:
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: router
namespace: metallb-system
spec:
myASN: 64513
peerASN: 64512
peerAddress: 10.0.0.1
Plus a BGPAdvertisement and an IPAddressPool for the VIP range, which I will not paste in full because they are exactly what the docs say and nothing interesting happens there.
The part the tutorials skip
It came up immediately and nobody warns you. By default the BGP session uses the node's primary IP, and the route's next-hop is that same IP, so far so good. But if your nodes have multiple interfaces, or you run the peering over a different subnet than you expected, the next-hop the router learns can be an address it cannot actually reach. The session establishes, the routes appear, and the traffic blackholes. The session being "up" tells you the TCP connection works, not that the data path does.
The fix was to pin which source address MetalLB peers from and to sanity-check with vtysh -c 'show ip route' on the router that the next-hops were addresses on the management subnet, not some pod or service network that only exists inside the cluster. Once the next-hops were addresses the router had a real route to, it just worked.
The other thing: do not advertise a VIP that overlaps your actual LAN, and do not let the cluster advertise a default route by accident. A misconfigured redistribute and you have handed routing of your entire house to three Raspberry Pis. Ask me how I know I checked twice.
Was it worth it
For the DNS VIP, unambiguously yes. Node reboots are now invisible. I drain a node for maintenance, the route withdraws cleanly, nothing notices. The failover that used to be "somewhere between instant and forty seconds depending on which device you ask" is now sub-second and the same every time, because it is a control-plane event rather than a cache expiry.
Is it overkill for a loft? Of course it is. But I now have a tiny, honest BGP setup I can poke at, watch converge, and break on purpose, which has taught me more about how the protocol actually behaves than any amount of reading about it for work. The homelab earns its keep when it lets you make real mistakes at a scale where nobody pages you. This one did exactly that, and the house DNS is better for it.