|
Author |
Thread Statistics | Show CCP posts - 13 post(s) |
|
CCP Fallout
|
Posted - 2010.06.30 13:35:00 -
[1]
As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
Fallout Associate Community Manager CCP Hf, EVE Online Contact us |
|
|
CCP Yokai
|
Posted - 2010.06.30 15:50:00 -
[2]
Edited by: CCP Yokai on 30/06/2010 15:53:25 PICS!!!
Overhead before cable
Cables
Connecting and testing each and every one of the Ethernet ports
Cleaned up
Overhead view of the cabinets with a look at the air containment transparent tiles
We blew out a shoe
The team for this trip... (not the move team, but one of two prep trips) CCP cNOC CCP Mindstar CCP Yokai CCP Zirnitra
This is all we have for now... all the pre move prep work. The post move pics we'll do next month after we finish up migrating the non TQ items.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 10:11:00 -
[3]
Originally by: Libin Herobi Looking forward to not receiving any answers in this thread.
Patch day today.
I'll start grinding through responses after downtime today.
Thanks!
|
|
|
CCP Yokai
|
Posted - 2010.07.01 11:32:00 -
[4]
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
This is correct we did have our normal backup that was a few hours out of date. And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 11:34:00 -
[5]
Originally by: T'ealk O'Neil Would it not be an idea in future when doing any patching / moving to take a backup of the database as it stands before starting - that way a recovery is simple, rather than trying to repair everything, which takes forever
Bonus points for being the first to say it... again. Dead right and the process now on everything even remotely risky.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 11:36:00 -
[6]
Originally by: Camios Edited by: Camios on 30/06/2010 15:59:58 You were grinning, that means that this photo has been taken before the mess.
cool pics btw
This was one of the last prep moves before the TQ move. Mainly just making sure the Ethernet systems were in good order.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 11:45:00 -
[7]
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
I waited on posting the exact details until after I had quite a few or our vendor experts chime in to make certain we had the root.
One of the links to our RAM SAN Storage corrupted data being written to the storage device.
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
I hope that helps clarify... Where the root of the issue that caused the corruption was and what we fund to be the problem. The nuking from orbit bit is not my preferred method of troubleshooting, but again... choice of get TQ online faster or satisfy my desire for empirical data... I choose TQ.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 13:26:00 -
[8]
Edited by: CCP Yokai on 01/07/2010 13:27:26 Edited by: CCP Yokai on 01/07/2010 13:27:07
Originally by: Nofonno Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work...
Originally by: CCP Yokai
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking.
I'm stumped.
It is pretty complicated and why it wasn't in the initial posting while we verified. But hundreds of thousands of both in and out of frame errors without crc errors on corrupted payload and you have the problem we had.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 13:34:00 -
[9]
Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Originally by: Commander Azrael
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
|
|
|
CCP Yokai
|
Posted - 2010.07.02 11:09:00 -
[10]
Just wanted to throw out a quick "Thank you" to everyone for posting, replying, helping out. I do appreciate the comments and plan to keep delivering more info on future projects as soon as possible.
I'll be monitoring this thread a bit less than F5 every few hours now... so apologies if I'm delayed on future replies.
Thanks again CCP Yokai
|
|
|
|
CCP Yokai
|
Posted - 2010.07.05 16:17:00 -
[11]
"Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
|
|
|
CCP Yokai
|
Posted - 2010.07.05 16:21:00 -
[12]
Originally by: Libin Herobi Some things take longer to process than others.
It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible...
The backup was started at 09:00GMT and the nature of the backup we are doing backs up and adds changes live to the backup. It was completed at about 11:05GMT. Backups do not have to start after everyone leaves the game. So, yes, we did a complete backup before the Tyrannis 1.0.3 patch.
I hope that helps clarify.
|
|
|
CCP Yokai
|
Posted - 2010.07.06 12:34:00 -
[13]
Originally by: IngarNT Errors on the san? its brocade right? (YES)
it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch.
if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM.
Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path
Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery... |
|
|
|
|