Ramblings of an aging IT geek
← Ramblings of an aging IT geek
gamedev

the unreal rpc reliability flag that quietly ate my events

How Unreal's reliable and unreliable RPCs actually behave under load, the saturation trap that drops events, and the rules I now follow to stay out of it.

A game development screen with code

I spent the better part of a Saturday convinced I'd found a bug in Unreal's networking. Players were occasionally not seeing a gameplay event fire: a door that should have opened for everyone opened for some. It was intermittent, it only happened under load, and it had every hallmark of a genuine engine fault. It was not an engine fault. It was me, reaching for Reliable on every RPC like it was free, and discovering the hard way that it absolutely is not.

If you write multiplayer in Unreal, the reliability flag on your RPCs is the single setting most likely to ruin a day, so it's worth understanding properly rather than cargo-culting.

What the flags actually mean

An RPC in Unreal is a function marked to execute on a different machine than the one that called it. You annotate it with where it runs and how it's delivered:

UFUNCTION(Server, Reliable, WithValidation)
void ServerOpenDoor(AActor* Door);

UFUNCTION(NetMulticast, Unreliable)
void MulticastPlayFootstep(FVector Location);

Reliable means the engine guarantees ordered, eventual delivery: it'll resend until the other end acknowledges. Unreliable means fire-and-forget over UDP, and if the packet is lost, it's gone. The naive reading is "reliable is safer, use it everywhere". That reading is a footgun, and the door bug was me pulling the trigger.

The reliable buffer is finite

Here's the part the documentation mentions and everyone skims past. Reliable RPCs go into a per-connection buffer, and that buffer has a fixed size. The engine keeps unacknowledged reliable messages around so it can resend them. If you generate reliable RPCs faster than the client can acknowledge them, that buffer fills. And when it overflows, Unreal does not gracefully degrade. It disconnects the client, or in some versions silently drops, depending on settings, but either way your "guaranteed" event is now gone and your player is having a bad time.

So the guarantee is conditional. Reliable means "delivered, provided you didn't saturate the channel", and under load, saturating the channel is exactly what an over-eager developer does. My door event was reliable, which was correct. But it shared a connection with a pile of other reliable RPCs I'd marked reliable out of laziness, and under load those crowded the buffer and the engine started shedding.

Code on a screen

The rule I follow now

The discipline is simple to state and takes effort to hold to. Reliable is for state that must arrive and arrives rarely. Unreliable is for state that arrives constantly and where the latest value supersedes the last.

Concretely:

  • A door opening, a round starting, an item being picked up: reliable. These are discrete, they're infrequent, and missing one breaks the game.
  • Position updates, rotation, a footstep sound, cosmetic effects: unreliable. They fire constantly, and if one is lost the next one is along in 33 milliseconds with fresher data anyway. Making these reliable is how you flood the buffer.

The thing that should arrive constantly should almost never be reliable, because constant plus reliable is the recipe for saturation. And the thing that must arrive should be small and infrequent enough that the buffer never notices it.

Don't fight the buffer, avoid replicating the noise

The deeper fix, once I'd recategorised everything, was to stop sending so much in the first place. A lot of what I was pushing through reliable RPCs didn't belong in RPCs at all. Replicated properties, the UPROPERTY(Replicated) sort, handle continuous state far better: the engine sends deltas, coalesces updates, and only the relevant value, not the history, reaches the client. For genuine continuous state, a replicated property with a RepNotify callback is almost always the right tool, and the RPC is the wrong one.

That left RPCs for what they're genuinely good at: events. Discrete things that happen at a moment. Once I framed it as "RPCs are events, replicated properties are state", the reliability choice mostly made itself, because events are usually rare and reliable, and state is usually continuous and better off as a property.

The fix, and the lesson

The actual code change was small. I moved a dozen RPCs to unreliable, converted three to replicated properties with notifies, and left the genuinely-must-arrive events as reliable. The door has opened for everyone, every time, since. Under the same load test that used to break it within a minute, it ran for an hour clean.

The lesson is older than Unreal. "Reliable" in a distributed system is never free and never unconditional, it's a promise with a cost attached, and if you don't pay attention to the cost the promise breaks at exactly the moment you most needed it: under load. Unreal makes the flag a single word in a macro, which makes it very easy to forget there's a buffer behind it with your name on it.