Networking Woes: MTU and Packet Loss

Our game, Gravitas: The Arena, is using a custom high-level networking solution built on top of Lidgren.Network.  We've had a few lingering networking bugs for awhile now that have gotten buried under other priorities.  Today I sat down to try to solve all of them, and luckily enough managed to do so.

Aside from a few minor bugs, there were two remaining major issues that I wanted to address today:

  1. Intermittent, extreme packet loss over LAN
  2. Packets not coming through before a notable delay over the internet when using UDP hole punching (but not Hamachi)

Packet Loss Over LAN

During one of our playtests recently, we noticed that all transforms seemed to intermittently freeze for the client. "Odd," we thought, and continued about our other development work, quietly hoping that the issue would resolve itself somehow.  Unfortunately, it did not, and today was the day I would figure out the problem once and for all.

At first, I thought the issue had to do with my snapshot interpolation implementation.  Each NetworkedTransform uses a ring buffer to store the last X transform states, where X is big enough so that if we miss a few packets, we still have something to display.  Since the ring buffer is indexed by targetSnapshotID % bufferLength, if something was wrong with snapshot insertion, all networked objects would be reading invalid snapshots at the same time.  That would nicely explain why freezes were intermittent and seemingly predictable, so I disabled snapshot interpolation to hopefully verify this was the problem.  And...

Everything still froze.

Okay, I thought, so this problem just got a lot more complicated.  If there's nothing wrong with the snapshot interpolation code, there's something much more sinister going on.  So I started throwing breakpoints and print statements all over the place hoping to find some unexpected behavior.  Before too long, I found the source of the problem: none of the transform packets were arriving at all.

Huh.

I went over to the machine I was using as the host and started plopping down breakpoints on it, thinking that perhaps it was an issue with the host rather than the client.  Nope - packets were getting happily sent off.  They were just getting lost in somewhere in the mix.

This felt really odd.  If this were happening over a spotty internet connection, I would have just attributed it to some really weird packet loss and continued to put off investigation until later, hoping optimization elsewhere would solve the problem.  But this wasn't over the internet, it was over LAN (WiFi, mind you, but with two machines sitting within 15 feet of a fairly beefy router).  Surely something else had to be wrong here...

I remembered this article from a few years ago and thought my problems might have something to do with MTU.  I was on WiFi after all, where one of my machines has a less than stellar WiFi card. Lidgren's default MTU is 1500, which seemed a bit high.  I dropped it down to 1200 and - voila! - like magic, all of the transform freezes were gone.  Phew, all that worry for nothing!

I will play with this value a bit more in the future (1200 feels a little bit low for 2018), but since our only packets that could ever possibly go over that are for transform synchronization, 1200 seemed like a fine, if slightly conservative, number to choose.

With the LAN issues taken care of, it was time to address our hole punching bugs.

Hole Punching Packet Loss

For now, Gravitas primarily relies on listen servers for multiplayer gameplay.  Since we don't want players to have to mess with port forwarding any time they want to play with their friends (and since we want to have matchmaking), we implemented UDP hole punching with a simple master server running on an AWS EC2 instance.

Connections worked well enough - as long as packets were streaming from user to user continuously, we didn't have any random disconnects or anything.  However, we noticed that it sometimes took awhile (up to 30 seconds) before the client started to receive data from the host.  Furthermore, it only seemed to happen during some games - other times, it worked perfectly fine.  Playing via Hamachi was also flawless.  Since we've only done controlled playtesting so far and it has been easy enough to have players connect by IP address, this issue has been more or less on hold for a couple months now.  The last time any work went into it, we didn't have any particle effects or sounds in place!

Today was the first time I'd tried to investigate the issue since it had first cropped up.  The first thing I did was connect via the server list (which initiates the whole NAT traversal flow).  Initial connection worked fine, I readied up on both sides and...  I was in the game!  Transform synchronization was working, both players could shoot at and damage each other...  Everything seemed fine!  Could the MTU changes have solved the issue?

I started another match to verify that we were in the clear, and... everything was frozen for the client.  D'oh!

Okay, I thought.  This is exactly what was happening before, I'll just wait thirty seconds or so and transforms will start to synchronize again, then I can analyze the game state.  So I turned on the networking stats and waited...

And waited....

And waited....

Nothing was happening.  Odd, since before, everything started working after a bit of time passed.  But what was stranger was that particle effects and sounds were playing, despite the fact that no transform synchronization was happening.  That, in particular, was really odd, since before it seemed like no packets were getting through in that erroneous chunk of time (remember when I said we didn't have sound then?).  So I started to think about the differences between transform packets and the effects packets:

  • Transform packets are sent unreliably, whereas particle and sound spawns are sent reliably, as static RPCs (our networking API has both per-object RPCs and global RPCs, which we call static RPCs)
  • Transform packets are significantly larger than particle and sound spawns

Now, it is important to note UDP is inherently an unreliable protocol.  Any libraries out there which implement reliable UDP are doing so by building their own reliability layer on top of unreliable packets, generally by attaching unique identifier to reliable packets and having the other side acknowledge when it has received it.  If no acknowledgement is received, the sender send the message again.  That's to say that even though particles and sounds were sent reliably, the underlying protocol didn't see them as any different.  So that didn't immediately seem like the problem.

The size, however, seemed like it could be more of a problem.  Our largest transform packets - when all transforms are moving - sometimes hover right around our old MTU of 1500.  Since I had just dropped MTU, could it be that the transform packets were always over MTU and were always getting dropped?  Even if transform packets were going over MTU, shouldn't Lidgren be fragmenting them automatically (since it handles fragmentation and reassembly of large packets)?

Then I start to think about how fragmentation works in general.  If a packet is unreliable and only half of it arrives on the other side, then it couldn't possibly be reassembled.  In this case, it seems like unreliable packets over MTU should just be dropped.  To test this theory, I swapped transform sync over to be sent reliably.

And everything worked.  Flawlessly.

Fairly frustrated with my own stupidity for not thinking about the impracticality of reassembling unreliable packets, I refactored our transform synchronization code to do manual fragmentation (since objects track their own snapshot history, with a little bit of modification to our delta compression, we can handle partial snapshots perfectly fine) and everything was good.

It works!  (I hope...)

While it hasn't been super extensively tested yet, it seems like batching and the MTU size drop have solved the problems entirely, even if it does incur a slightly higher overall network usage.  I really wish Lidgren threw out a warning or error if you sent an unreliable packet over MTU, at least in debug mode.  Unless, of course, my conclusions were completely incorrect, in which case I would love for someone to let me know what was actually going wrong.

Regardless, thanks for taking the time to read this wall of text.  I promise, more networking posts are coming - as well as more info about Gravitas!

Until next time...

2 Replies to “Networking Woes: MTU and Packet Loss”

  1. Hello!
    Thank you for the interesting post! Recently I was googling for the optimal MTU to use for the starting number and found it.
    I’m the programmer of CryoFall https://store.steampowered.com/app/829590/CryoFall/ and using Lidgren network library in our studio’s games since 2013 🙂

    I found your post very interesting. We had some reports recently from players who were unable to connect the game servers and I was surprised to find out that Lidgren using the default MTU value of 1500. I’ve asked affected players to drop me the screenshot of network report (http://www.speedguide.net:8080) and indeed their MTU often were about 1320-1380.
    Reducing it to 1200 indeed helped to resolve the issue and stay safe!

    BTW, there is an option to enable MTU auto-discovery (NetPeerConfiguration.AutoExpandMTU). Have you tried it? Seems to work fine with our game.

    Regards!

    1. Sorry for the very late response, but I’m glad you enjoyed the post! It’s good to see another happy user of Lidgren.Network haha

      Thanks for sharing speedguide – never heard of it before, and it certainly would have been handy when we were fighting these bugs. Glad that lowering the MTU solved your problems. When I posted this on /r/gamedev, there was a AAA dev that said he always kept his packets under 1024 bytes for exactly these reasons, so that’s the rule of thumb I’ll likely follow going forward. And realistically, if you’re going higher than that for unreliable packets, you’re probably doing something wrong.

      I remember having some problems with AutoExpandMTU, but to be honest, I can’t remember exactly what they were anymore since that was quite awhile ago. I feel like it was making the stuttering issue described in this post worse, though thinking back on it, we turned it off before fixing that bug, so it might be fine now. That being said, I don’t think we would get much from having it on and I can see it potentially causing issues depending on how it’s implemented, so might as well be safe and just leave the static value of 1200.

      Your game looks neat by the way! Hope your early access is going well.

      Thanks for responding!

Leave a Reply

Your email address will not be published. Required fields are marked *