Well, the last 24 hours have been hectic for the developers. The servers were down, new publish, customer support not working, and I'm sure a load of other things behind the scenes. Blair made a post explaining the process each publish goes through, why it takes so long, and answers two very good questions.
Greetings...
I want to spend a few moments with you talking about everything that happens when we do a monthly update, as well has happened in the last 24 hours.
The first thing I would like to do is define a couple of terms. The first is a hotfix, the second monthly updates.
You see hot fixes every few days. I usually try to post a "What do we have on the plate for tomorrow" post containing what's coming up with the maintenance the next day. Hot fixes usually contain updates that are low risk and high bang for the buck. We might fix a server crash, fix a client crash, fix an exploit, or correct a big problem that we have recently solved. Once we have a set of finished code and data, we have to propagate the information to all the servers. This process takes many hours. There is a great deal of data that needs to get pushed across both sides of the country. Servers come down in the morning, and servers come up running the new version.
Monthly updates are a bit different. We start by moving all the current code, scripts, and data onto an internal QA server for testing. After a week, it moves to testcenter where it usually stays for 2 weeks. The last step in this process is moving the monthly update to live clusters.
When we do a monthly update the first step is the same as with a hot fix, we must push the code and date to each cluster. Another critical step in the process is the database update. DB updates prepare the database for the next version of code and data. This takes many hours because we can't make all the database changes simultaneously. Once the database changes are made, we bring the servers up for an initial test run. Yesterday we ran into the series of problems outlined below:
1. Customer service problem. Customers were not able to submit tickets.
2. Commodities market crash. The server that controls the bazaar and vendors had a problem and continuously crashed.
3. If the commodities market did run, it quickly ran out of memory.
4. Factory crash. We quickly found two factory problems that would cause a gameserver to crash.
5. Gameserver out of memory. We found a problem where a gameserver would run out of memory, but we didn't have enough information in order to track the problem down. We had to build a new gameserver, push it to testcenter to get more information on the crash, make a fix, push it to testcenter to ensure it worked, then push it to live.
At each step, we have to build, test, push out to testcenter, wait, push out to live, find the next problem, wash, rinse, repeat. Unfortunately, when debugging problems, it must be done serially. We push something out, a problem is found, it gets fixed, we push it out, we find the next problem, etc. Eventually we get though all the problems, and the server is up, stable, and you are having fun. Unfortunately, this process took us most of the day for more than half the servers.With the rancor yesterday, there were two really good questions asked that I want to address. They are: "why don't you undo the monthly update?" and "why don't you roll out the update to one server to do an initial test run?" I'd like to take the time to answer both of these.
Why don't you undo the monthly update?
Most of the time, every server has 2 versions of the game on it: The version it is running, and the version before the one it is currently running. This way if we ever run into a problem, we can bring the servers down, change a config file, and bring the servers back up using the older version with a minimum of downtime. We've done this once since launch.
When we do a monthly update, things are a bit different. Monthly updates often require a lot of data that has to change in the database. Indeed, the first step of rolling out the monthly update is to update the database with the latest data such that it will work with the newest version of server and client code. This process takes multiple hours. Once this is done, we are in the unique situation where the last version of the game won't work with the data currently in the database. We have the ability to run an undo on the database, it takes many many hours. It would have taken longer than to perform the initial db updates, and longer than to solve the existing problem. We want you to enjoy the game. Please understand that if we could have simply clicked "undo", we would have.
Why don't you roll out the update to one server to do an initial test run?
Testcenter is the server that handles this. The next question becomes why don't we roll it out to a single live cluster? When there is a monthly update, there is a new client exe and new client data files. Right now, the launchpad has no knowledge of characters that you have, and no knowledge of what character you want to play. How does the launchpad know what version of the game to patch to you, the new version or the old? What happens when you log to a new server with old data? What happens when you log in with old data to a new server? The answer in both cases is bad things. Of course these questions can be answered and the problems solved, and this may be the direction we head long term. It's not as quick and easy as we would like it to be.
To say we had some problems yesterday is putting it mildly. We are working on a process review internally to determine exactly what is going to be different for the next update. The first question for us is why the problems listed at the top of the page weren't identified on test center. For the moment, we don't know.
The next update gets exciting for us here, watch the forums for more details. We are working hard and have a talented team of professionals trying to make the best game possible. We love the game and are dedicated to making your experience the best it can be. Making an MMO is the most complicated game you can make, and we are going to make mistakes. But we learn, and do better the next time around. Thanks for your patience with us.Blair
Associate Producer
