Ramblings of an aging IT geek
← Ramblings of an aging IT geek
gamedev

the unreal rpc that fired most of the time and ruined an afternoon

A debugging tale about Unreal's reliable and unreliable RPCs, why an event was occasionally going missing, and the reliability footgun hiding in the replication model.

A game development scene on a monitor

I lost most of a Sunday to an event in my Unreal project that fired correctly perhaps nine times out of ten and silently vanished the tenth. It was a networked RPC, and the bug was the most embarrassing kind: not a crash, not an error, just an action that occasionally didn't happen, with nothing in the log to say why. The cause was a misunderstanding about Unreal's reliability model that I should have known better than to make. Here's the footgun, so you can avoid standing on it.

The feature was simple. A client does a thing, a button press effectively, and the server needs to know so it can apply the result authoritatively and replicate it back out. Standard client-to-server RPC. In Unreal you mark a function with UFUNCTION(Server) and it gets sent to the server when the client calls it. I had this working, mostly, which is the worst state for anything to be in.

reliable versus unreliable

The crucial detail, and the one I'd glossed over, is that a Server RPC is Unreliable by default unless you say otherwise. Unreliable RPCs are sent best-effort. If the packet gets dropped, lost in the noise of a busy frame, exceeded a bandwidth limit, whatever, it's simply gone. There's no retransmission, no acknowledgement, no error. The engine does not consider this a failure, because for the things unreliable RPCs are designed for, frequent updates where the next one's along in a moment anyway, dropping one is completely fine. Movement, cosmetic effects, the sort of thing where a missed update is invisible because it's superseded almost immediately.

// what I had: silently unreliable
UFUNCTION(Server, WithValidation)
void ServerDoTheThing();

// what I needed: reliable
UFUNCTION(Server, Reliable, WithValidation)
void ServerDoTheThing();

My event was none of those things. It was a discrete, one-off action that mattered exactly once. There was no "next one along in a moment" to paper over a drop. When the packet went missing, the action just didn't happen, and because it was network jitter that caused the drops, it happened intermittently, on a busy connection, never on my nice quiet local test. The single best way to make a bug invisible during development is to make it depend on packet loss, because your loopback never loses any.

The RPC declaration in the header, before and after

chasing a ghost

Of course I didn't suspect the RPC for hours, because RPCs are "just function calls", right? They mostly behave like one. So I went looking everywhere else first. I added logging on the server side and confirmed the function genuinely wasn't being entered on the bad runs, which at least ruled out a logic error inside it. I checked I wasn't accidentally calling it on the wrong actor, or on a client that didn't have authority, or before the connection was fully up. I checked the WithValidation function wasn't quietly rejecting the call, which it can do and which does fail silently if you've not wired up the logging for it. All clean. The call was being made on the client and simply not arriving on the server, sometimes.

The moment it clicked was when I finally read the generated boilerplate and saw the call going out on an unreliable channel. Of course it was. I'd never added Reliable, and the default had quietly decided this important one-shot event was as disposable as a movement update. One keyword. The whole afternoon, one missing keyword.

In hindsight the symptom had been telling me this all along and I wasn't listening. The failures clustered when I tested over a real connection to a remote machine and never once on the same box, which I'd lazily filed under "networking is flaky" rather than treating as the actual diagnostic signal it was. Intermittent-only-over-the-network is practically a fingerprint for an unreliable-channel problem. If the function were never registered, or called on the wrong actor, it would fail every time, deterministically. If validation were rejecting it, it would fail for specific inputs. A failure that depends purely on the physical quality of the link, and disappears entirely on loopback, is the network dropping something the engine was never asked to guarantee. I had all the evidence on day one and spent it suspecting my own logic instead.

The packet flow, reliable channel versus best-effort

the footgun, named

Adding Reliable fixed it instantly and completely, and I've not seen the event go missing since, even under deliberately awful simulated network conditions. Unreal has a built-in network emulation setting precisely for this, letting you inject packet loss and latency, and I now run a chunk of my testing with a few percent loss dialled in specifically to flush out exactly this class of bug. If a feature breaks under simulated loss, it was relying on reliability it never asked for.

The footgun is this: the default is unreliable, the failure mode is silence, and your development environment hides it perfectly. Three things lining up to let you ship something that works on your machine and flakes in the wild. The mental model that bit me is treating RPCs like local function calls, where a call always runs. They aren't. They're messages across a lossy network, and the contract is whatever reliability flag you set, or whatever the default sets for you when you forget.

So the rule I've written on the metaphorical wall: every RPC gets an explicit reliability decision, on purpose, in the declaration. If it's a discrete event that has to land, it's Reliable, no exceptions. If it's a frequent update where the next one covers a miss, it's Unreliable, and I've thought about why. The thing I will not do again is leave it to the default and find out the hard way, one dropped packet at a time, on a Sunday I'd rather have spent doing almost anything else.