Folks,
Tiggs and I wanted to give you more information about today's down time. It was caused by two things:
1) We had an error with the patching process. This happens infrequently, and we try to avoid it at all costs. Whenever it does happen, we always put a procedure in place so we minimize the likelihood of it happening again. This was responsible for approximately 1.5 hours of the unexpected down time.
Once we diagnosed and corrected the first issue, we ran into a new and unexpected issue:
2) We added some new logging to our Commodities server to improve its function and the the ability of our customer service people to assist customers. This logging ended up overloading the Commodities server at load-time and causing an out-of-memory crash. Once we diagnosed the new issue, we had to do a new build and push out the new commodities server to all clusters. This was the other 4.5 hours of unexpected downtime.
We did not catch this on Test Center because the size of the TC database is not as large as our production servers. We have used production galaxy backups in the past for testing, and moving forward it is going to be policy for us to run a copy of a large production server on TC for at least a day.
Again, please understand that we are extremely sorry about this downtime, and we will work as hard as possible to minimize any similar instances in the future.
We hope you like the additions we put into the game for Publish 12.1 and the ones we will announce soon for the next publish.
Best,
- g
