Dev Blog: Stackless IO or: how we reduced lag

NAT Mav · Sep 26, 2008

StacklessIO

For the past two years we have been developing new technology, called StacklessIO, to increase the performance of our network communication infrastructure in EVE. This new network layer reduces network latency and improves performance in high-volume situations, e.g., in fleet-fights and market hubs such as Jita.

On 16 September we successfully deployed StacklessIO to Tranquility. We noticed an astounding, yet expected, measurable difference.

Normally Jita reaches a maximum of about 800-900 pilots on Sundays. On the Friday following the deployment of StacklessIO there were close to 1,000 concurrent pilots in Jita and on Saturday the maximum number reached 1,400. This is more than have ever been in Jita at the same time. Jita could become rather unresponsive at 800-900 pilots but on Sunday it was quite playable and very responsive with 800 pilots. It should continue to be snappier and more responsive in the future.

http://staff.ccpgames.com/explorer/devblogs/images/playerpopinjita.png

The Measurements

This spring we saw the fruits of our R&D work when we deployed StacklessIO to Singularity and began measuring the difference.

Confirming suspicion we had had for a long time the Core Server Group team, lead by CCP porkbelly, proved that StacklessIO vastly outperformed the old network technology. They also demonstrated that the old technology could sometimes, under extreme lab conditions, delay network packets in an arbitrary manner for a significant amount of time.

Later CCP Atlas of the EVE Software Group showed that those symptoms also happened in wild with the old technology; although on a smaller scale, then network response to client requests could in some cases be delayed for a few minutes on highly loaded nodes in the cluster. In particular we measured client network communication to the node that hosts Jita.

Since the client and server clocks are synchronised then we called a remote service on the server from the client, the server responded with the global time and we measured the server and client deltas. We also called that same service directly on the server node to measure the service call's processing time, which turned out to be negligible.

What we discovered in our tests is that the server delta was almost identical to the client's received delta so the delay was due to the remote service call taking a long time to reach the server-side service, most likely somewhere in the network layer on the server. The values on the graphs below are seconds.

http://staff.ccpgames.com/explorer/devblogs/images/delta_jetbyte.png

This is a Sunday profile and is very specific to Jita. This was one of the primary reasons why Jita could sometimes become fairly unresponsive on Sunday evenings. It was not uncommon that client requests could take up to 1-2 minutes to reach the service layer on the server, and the requests would be delayed seemingly randomly since for two requests in succession then the first one could be delayed for minutes while the second one would get a response almost immediately. From a player's perspective this would manifest itself in lag and strange client behaviour as requests were delayed and completed by the server much out-of-order.

By comparison, here is Jita with approximately the same number of players, around 800 pilots in local, after the deployment of StacklessIO.

http://staff.ccpgames.com/explorer/devblogs/images/delta_stacklessIO.png

It's very apparent that StacklessIO does not demonstrate any of the earlier issues. There is only one small spike and two small bumps but we must keep in mind that such isolated occurrences could be caused by general network issues on the internet. Since the client/server network communication has to travel through the internet then some delays would be expected depending on general internet health and the particular ISP.

There are no systemic issues anymore as with the old network technology and StacklessIO provides all-around superior performance.

One of the other measurements we did was to ping all nodes in the cluster from a single node to measure network latency within the server cluster. The values in the tables below are seconds.

Ping Pre-StacklessIO
Time Minimum Maximum Average Stddev
16:00 0.00065 3.22 0.042 0.032
21:00 0.00064 4.36 0.068 0.056
22:00 0.00065 1.21 0.027 0.027
23:00 0.00064 4.36 0.027 0.028
00:00 0.00065 1.01 0.020 0.017

Ping StacklessIO
Time Minimum Maximum Average Stddev
16:00 0.00064 2.00 0.014 0.021
21:00 0.00064 1.02 0.014 0.018
22:00 0.00064 0.25 0.009 0.011
23:00 0.00064 1.93 0.014 0.021
00:00 0.00064 1.06 0.010 0.014

From the table we notice that the minimum values are the same before and after. The lowest maximum is approximately the same but overall the maximum values are lower with StacklessIO by approximately a factor of 2 and they are more consistent.

The average values are lower overall with StacklessIO by a factor of 3 and the standard deviation is lower by a factor of 2. Below is a visual representation of the average values.

http://staff.ccpgames.com/explorer/devblogs/images/ping_avg.png

At 1,400 pilots on Saturday the node hosting Jita ran out of memory and crashed. As crazy as it may sound this was very exciting since we had not been in the position before to be able to have that problem. We immediately turned our attention to solving that challenge and are making significant progress. I will provide information on that specific effort in a dev blog later.

But we have already made good progress on memory optimisation as a part of the StacklessIO technology effort, e.g., memory usage on the proxy servers in the cluster reduced significantly.

http://staff.ccpgames.com/explorer/devblogs/images/proxy.png

The two tall peaks are memory issues we encountered in the first days after deploying StacklessIO. A task force was put into action and it reduced the memory usage by 50% compared to pre-StacklessIO values.

The graphs and measurements above show primarily statistics for Jita but the benefits of StacklessIO apply everywhere. We measured Jita in particular because we could rely on activity and regular load in Jita for measurements. StacklessIO should have a positive impact on your playing experience, no matter where you are in the EVE universe and no matter what you are doing.

Gurthang · Sep 26, 2008

This is the best dev blod ever!

No fair editing thread titles!

Phalanx · Sep 26, 2008

so is this like how they spent billions of krona on some supermegacomputercluster that was supposed to end lag back a few years ago

-=TTC=-Serpreme · Sep 27, 2008

Much like tribes, i could load a area with thousands of beacons. And my old computer would crash due to the software not handling it.
I got a new computer, probable about 50x faster than my first rig, and i ran into the same issue.
If the software can not handle the information, it doesn't matter how much hardware you have.

Profhet · Sep 28, 2008

it kind of reads like they are making some progress though? I dunno, I'm not interested nor an expert.

-=TTC=-Serpreme · Sep 28, 2008

Yes, they are.
I wonder if we have any testimonial on how much of an improvement this.

Breyer · Sep 29, 2008

I don't play eve but this is cool. Could be significant progress... hopefully they start licensing the technology.

Bezel · Sep 29, 2008

I'm making a Jita run today, I need to see it to believe it

NAT Mav · Sep 29, 2008

According to some posts on EVE-O, it has been deployed accross the entire cluster. However, some large battles in Tribute yesterday resulted in several node crashes, before CCP moved those sytems to a newer, dedicated 64-bit server and it improved dramatically. So while the new code looks promising, as they found in the Jita testing, the bottleneck has just moved elsewhere, so it's not a magic dagger. They still have alot more work to do, but they are at least starting to hit the right areas.

FalseMyrmidon · Sep 29, 2008

Magic dagger? Can't say I've ever heard that one before.

Still, that's pretty impressive looking for the results.

-=TTC=-Serpreme · Sep 29, 2008

NAT Mav said:
According to some posts on EVE-O, it has been deployed accross the entire cluster. However, some large battles in Tribute yesterday resulted in several node crashes, before CCP moved those sytems to a newer, dedicated 64-bit server and it improved dramatically. So while the new code looks promising, as they found in the Jita testing, the bottleneck has just moved elsewhere, so it's not a magic dagger. They still have alot more work to do, but they are at least starting to hit the right areas.

Yeah, it pretty much clear up one problem area, allowing it to pass to the next choke point.
But unless you get rid of the core problems, you wont ever see much improvement.

FalseMyrmidon · Sep 29, 2008

-=TTC=-Serpreme said:
Yeah, it pretty much clear up one problem area, allowing it to pass to the next choke point.
But unless you get rid of the core problems, you wont ever see much improvement.

When it comes to performance you're always going to be bottlenecked somewhere :shrug:

.

NAT Mav · Sep 29, 2008

Yeah but we're talking 1000 players in a single system making it crash, but pretty fluid up until that point. That's definitely moving in the right direction.

-=TTC=-Serpreme · Sep 29, 2008

FalseMyrmidon said:
When it comes to performance you're always going to be bottlenecked somewhere .

True. Obvious, and true.
But that isn't a good attitude if you ever want this lag cleared up.
I for one, am all for them trying to increase the speed of the game.
So they fixed one major problem. Now they have discovered that hardware is an issue. So they increase that, and then they get this fiber core setup, and its back to the hardware. But at least we as gamers have gained from it.
I don't know about you, but i for one want to be part of 300 x 300 fleet battles without a 4 minute lag.
Hell, 5 second lag would be amazing. You would see actual tactics and strategy.
Sniper fleets in amazing numbers coming in, or fleets of stealth bombers.
I'm ranting now. gnight.

Helado · Oct 4, 2008

-=TTC=-Serpreme said:
fleets of stealth bombers.
I'm ranting now

made me laugh as this was an ongoing meme when I played.

Good for them. It's just too bad they started with the whole 'need for speed' and 'buy more hardware' before they decided to look into this. The analysis looks like an engineer could do it in a couple weeks, I'm not sure why someone didn't go down this alley earlier.

-=TTC=-Serpreme · Oct 4, 2008

Agreed.

Dev Blog: Stackless IO or: how we reduced lag

More options

NAT Mav

TW Racing++

Gurthang

Phalanx

Contributor

-=TTC=-Serpreme

Profhet

-=TTC=-Serpreme

Breyer

Bezel

NAT Mav

TW Racing++

FalseMyrmidon

1700+

-=TTC=-Serpreme

FalseMyrmidon

1700+

NAT Mav

TW Racing++

-=TTC=-Serpreme

Helado

Contributor

-=TTC=-Serpreme

Dev Blog: Stackless IO or: how we reduced lag