Ramblings of an aging IT geek
← Ramblings of an aging IT geek
gamedev

the rpc that worked on my machine and nowhere else

A walk through Unreal's RPC reliability model and the ordering assumptions that quietly broke our multiplayer prototype.

A game development workstation with a screen lit up

We had a co-op prototype where picking up an item sometimes left the client holding nothing, even though the server clearly knew you had it. On my machine it never happened. On a colleague's machine across the office it happened maybe one time in twenty. Over a real connection to a playtester at home it happened constantly. Classic. The bug scales with latency, which means it was never a logic bug, it was an ordering bug, and the culprit was how we'd wired up our RPCs.

If you've not done Unreal networking, here's the shape of it. A UFUNCTION can be marked Server, Client or NetMulticast, and either Reliable or Unreliable. The reliability flag is the one everybody gets wrong first, and it's worth being precise about what it actually promises, because it does not promise what most people assume.

what reliable actually means

UFUNCTION(Server, Reliable, WithValidation)
void ServerPickupItem(AItem* Item);

UFUNCTION(NetMulticast, Unreliable)
void MulticastPlayPickupEffect(FVector Location);

A Reliable RPC will be delivered. It will be retransmitted if a packet drops, and it will arrive eventually. An Unreliable RPC might be dropped entirely and never resent, which is exactly what you want for cosmetic things like a sound or a particle burst, where a missed one doesn't matter and you'd rather not spend bandwidth guaranteeing it.

So far so reasonable. The footgun is the bit nobody tells you up front: reliability guarantees delivery, and it guarantees ordering relative to other reliable RPCs on the same channel, but it gives you nothing across the reliable/unreliable boundary, and crucially it gives you nothing about ordering relative to property replication.

That last clause is the one that bit us.

A diagram of client and server message flow

the actual bug

Our pickup flow looked sensible on paper. Client calls ServerPickupItem. Server validates, adds the item to the player's inventory (a replicated array), and then fires a Client RPC back to confirm and update the UI. The inventory is a UPROPERTY(Replicated). The confirmation was a reliable client RPC.

The assumption baked in there is that the replicated inventory and the RPC arrive together, in order. They don't. Property replication and RPCs travel through related but distinct mechanisms, and an RPC can land on the client before the property update that it logically depends on. So the client received "you picked up the sword", ran its UI code, read the replicated inventory, found the sword wasn't there yet, and drew an empty slot. A frame or two later the inventory replicated in, but nothing re-ran the UI, so the slot stayed empty until something else dirtied it.

On a fast local connection the property and the RPC arrived in the same tick often enough that I never saw it. Add 80ms of real latency and the gap widened until the race lost almost every time.

what we changed

Two things. First, we stopped treating an RPC as a signal that depends on replicated state having already arrived. If the client needs to know "your inventory changed, redraw it", the right tool is a RepNotify, not an RPC:

UPROPERTY(ReplicatedUsing = OnRep_Inventory)
TArray<FInventoryItem> Inventory;

void OnRep_Inventory();

OnRep_Inventory fires on the client after the new value has landed. By definition the data is there when the callback runs. That single change made the empty-slot bug vanish, because the UI update was now driven by the data arriving rather than by a separate message racing it.

Second, we got disciplined about which RPCs were reliable and why. We'd reflexively marked nearly everything Reliable because "reliable is safer", which is the instinct that quietly fills your reliable buffer with cosmetic noise. Reliable RPCs share a finite, ordered channel. Flood it with particle and sound calls and you can stall genuinely important messages behind retransmissions, and in the worst case overflow the reliable buffer and get a disconnect. Effects went Unreliable. State-changing actions stayed Reliable. The split is almost always: does the game break if this is lost? If no, unreliable.

the rule I'd give my past self

Reliable does not mean ordered-with-everything, it means delivered-and-ordered-amongst-reliables. Replicated properties are not RPCs and don't share their ordering. If a client action depends on replicated state, react to the state with a RepNotify, don't fire an RPC and hope the data beat it. And mark something Reliable because losing it breaks the game, not because reliable sounds responsible.

The thing that still nags me is how cleanly it hid. Every test on a single machine, or two machines on the same switch, passed. The model was wrong from day one and the network was just fast enough to forgive it. If you're building anything networked in Unreal, put a couple of hundred milliseconds of artificial latency in your editor's network emulation settings and leave it there. The bugs you find with the cable plugged in are not the bugs your players will find.