On behalf of the Operations Team, and with blog writing music blasting on full (Skálmöld), I am privileged to be telling you about a huge project called Tranquility Tech III, which we plan to complete by early 2016.
The project is called TQ Tech III (TQ is short for “Tranquility”, the live server for EVE Online) because this is the third time the EVE infrastructure is physically moved, and CCP is now also making a significant investment in all new hardware (Network, Storage, and Servers) whilst moving it all to a new hosting facility within London.
We’ve accomplished such a herculean task a few times in the past. Here’s a quick trip down devblog memory lane to remind you:
In the continued commitment to EVE Forever, it is time for TQ to level up again, with new methods and technology having emerged to help inform us in our approach to running the super-complicated service and game and universe that is New Eden.
TQ Tech III has a many highlights which I’ll go over and enlist fellow devs to explain.
Warning: It’s about to get super tech-freaky up in here!
TQ’s storage is mirrored and redundant. The storage array has always been redundant but now we’ve added more failsafes.
There will be a full SAN mirror so that we can both maintain TQ and failover live, replicating a copy of the TQ Database across the ocean to Iceland, land of fire and ice.
From the storage side this is how TQ will look once we’re done:
What you see here are 2x IBM SAN volume controllers which govern and control 2x IBM V5000 controllers which store all the data with 3x expansion shelves that house 9×800 GB SSD’s with a grand total of 83x 1.2TB 10K SAS disks.
Remember this is all mirrored so double these disk numbers for the full picture!
All this lightning fast and redundant storage has to talk to servers…
The New Servers
Since day one, EVE has been operated by IBM blade servers. For the upgrade we chose the next generation of IBM servers called IBM FLEX.The above picture demonstrates 1x FLEX Chassis connections to its storage
The FLEX concept is similar to the blade centers in that there is a chassis which has power and cooling and you can have up to 14x nodes in each chassis.
In comparison, the current TQ blade centers run on 4x 1Gbit network connections then each of the 14x blade has access to 2x1GB as they have 2x network cards.
The new IBM Flex chassis will have 4x 10Gbit network connections, giving each flex node access a total of 2x 10Gbit throughput.
For the current way that EVE runs this is an overkill but having this in place allows for our engineers to test out interesting new ways to scale TQ and architecture performance, that will take time of course but the immediate benefit is that our deployment times will speed up a lot!
Also this will drastically improve our virtual server environment for example when we live migrate Virtual Machines across physical hosts.
Oh yeah, we will have 6x chassis btw :D
Isn’t she a beauty?
You can see a lot of redundancy in the internal components, which is planned so we can maintain TQ by swapping a whole chassis out of rotation while EVE players continue to battle, trade, chat, manufacture, explore, and scam on the other 5.
The servers are connected to the storage via 2498-F48 IBM SAN 16Gbit Switches with everything cross-connected so there is no single point of failure.
The Mahālangūr Himāl
Today we have one server which we call the “Everest” node. It is assigned to all the most prolific high–load EVE situations, typically the biggest fleet battles in all of gaming.
With TQ Tech III there will be 6x Everest Nodes.
That leaves a lot of potential for a lot of spaceships to explode at the same time. It also means that certain alliances can forget to pay several crucial bills at one time if they so choose!
The Sexy TQ Database Machine
Let’s take a deeper look at the cluster, starting with TQ’s database machines.
The 4x Microsoft SQL Database machines will have a whopping 768GB of RAM each running on 1866MHz. They have 2 Intel E7-8893 v3 – 3.2GHz CPU’s with 4 cores (8 hyper-threaded) and 45MB cache which are ideal for database-intensive workload.
To go into little bit more detail on the DB side, here are some notes from the Database Administration team.
The Database Clusters
Currently we have three main production DB clusters:
- TQ (2x cpu w/ 32 Hyper Threaded cores)
- Web (2x cpu w/ 24 Hyper Threaded cores)
- Account Management and Payment (2x cpu w/ 24 Hyper Threaded cores)
All three are on very different types of hardware, spanning multiple generations of architecture, and mostly held together with Minmatar duct tape, soulful Amarr prayers, the naïve and hopeful Gallente spirit, and low-grade mass-produced Caldari chicken wire from upgrades over the years.
With our New TQ cluster, we’re looking to consolidate and free up physical space so we’ll be combining Web and Account Management and Payment clusters while we keep TQ separate.
For CPU’s we are upgrading from ancient 5 year old X7560’s @ 2.26 GHz to brand spanking new E7-8893 v3’s @ 3.2 GHz. That’s a 45% increase in clock speed alone, and on top of that we get a huge 75% increase in memory bus speed by going from 1066 to 1866!! Make no mistake, we need all that extra memory speed as we go from 672GB of RAM to 1.5TB, yes you read that right – 1.5 TERA bytes of RAM! That’s how we roll these days, we count our TQ hardware memory in terabytes!
Keep in mind these numbers are just from active nodes, so the New TQ DB cluster total is double this when we factor in the secondary / passive nodes. 3 TB of RAM for our 2 production DB Clusters–mmmmm, tasty!
While we had discussed having a single two node Active-Active cluster, we decided against this for various reasons. For example, with one cluster.exe crash the whole shebang could go down. Having the TQ DB cluster isolated gives us great peace of mind across the whole system.
With that said, we have 4 amazingly powerful DB machines to host our two clusters and have come up with a very interesting plan in an attempt to maximize redundancy.
Virtualize All The Things!
Before we go further, keep in mind that this is a proof of concept that we have yet to test. It is entirely possible we’ll ditch this and just go with plain normal clusters (that still happen to be running on insanely UltraMegaShiny hardware from the heavens).
Our plan is to create a 4 node ESXi cluster farm with the 4 monster nodes. On top of the hypervisor we’ll build both of our SQL Server clusters with one cluster node per ESXi server – as if they were physical. No real change there!
The real benefit comes into play when (or if, but more likely when) one of these physical hosts needs hardware maintenance or needs to be taken offline for some reason. At this point, in a typical two-node physical cluster we would be forced to run on only one cluster node and have fingers crossed that our now-single-point-of-failure does not fail. Lots of soulful Amarrian prayers would be required.
With our virtual direction, we could simply vMotion the passive cluster node from its dedicated ESXi host to another ESXi host (the one hosting the 2nd cluster’s passive node)…and Bob’s yer uncle! Sure that one host with two passive nodes will now be over-allocated, but we would have to lose two more hosts before that becomes an issue!
This means that not only are we redundant on the SQL Instance level by using Windows Failover Clustering, but we’ll now be able to survive more than one hardware failure as well! We have a lot of testing to do with this but for the most part this is all proven tech, so really, what could possibly go wrong?!
CCP DeNormalized, CCP Hunter, CCP Stephanie, and CCP Jolin