Pages: 1 2 3 4 5 :: [one page] |
|
Author |
Thread Statistics | Show CCP posts - 13 post(s) |
|

CCP Fallout

|
Posted - 2010.06.30 13:35:00 -
[1]
As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
Fallout Associate Community Manager CCP Hf, EVE Online Contact us |
|

XXSketchxx
Gallente Remote Soviet Industries Important Internet Spaceship League
|
Posted - 2010.06.30 13:42:00 -
[2]
First.
|

Chiana Torrou
|
Posted - 2010.06.30 13:46:00 -
[3]
Contrary to many others who post on the forums I still think you all did a really good job in the face of very difficult circumstances.
Thanks for all the hard work - and the free skill points
|

Brolly
Caldari Icarus' Wings
|
Posted - 2010.06.30 13:46:00 -
[4]
Fantastic stuff, nice to keep in the loop.
I would have been surprised if the game was up in 6 hours tbh as we all know how funky computers can be. Great job though, kudos to all involved with the move.
|

Ban Doga
|
Posted - 2010.06.30 13:48:00 -
[5]
Not sure if I missed anything but all that failed was the new storage area network for the database? Since that happened while TQ was offline why was losing transactions something you wanted to avoid? Which transactions could have gone lost?
|

Baeryn
22nd Black Rise Defensive Unit
|
Posted - 2010.06.30 13:51:00 -
[6]
Edited by: Baeryn on 30/06/2010 13:51:23
Quote: ...we began recovering the corrupted transaction logs, and replaying them to fill in any missing data...
This happens in almost every MSSQL emergency recovery I've ever been involved in. MySQL, on the other hand, usually goes much more smoothly.
Thanks for busting ass to get it back up for us, though! Role Playing Games by RolePlayGateway |

schwar2ss
Caldari
|
Posted - 2010.06.30 13:52:00 -
[7]
Edited by: schwar2ss on 30/06/2010 13:53:18 Thanks for the feedback. Obviously you didn't feed the DB-hamsters very well. On a serious note: how can a db brought offline (and online later on) in a messy state? These procedures are meant to finish all transactions, save the logs, disconnect all users and detach the file from the Ddbms. Did these errors occur during replication of the logs when shutting the db down?
|

Devin Maximus
|
Posted - 2010.06.30 13:52:00 -
[8]
First of all thanks for the transparency of the issue. Having worked with providing internet based services before (on a MUCH smaller scale ofc) and having been bitten by bad data I appreciate the time spent making sure my shiny internet ships were all still in their hangar. I'd rather see a day lost then be missing a pretty ship I had just bought.
And to all you whiners out there why don't you step outside, take in the sky and perhaps go for a hike? Me thinks you've been stuck in a pod too long. ;)
|

Tanjia Guileless
|
Posted - 2010.06.30 13:54:00 -
[9]
"What are we doing to prevent this?"
Migrating to a serious database product?
|

FingerThief
Gallente
|
Posted - 2010.06.30 14:01:00 -
[10]
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
Define serious, just so I can get a few more laughs out of your post. Fighting like Don Quixote, one windmill at a time. |
|

Mashie Saldana
Red Federation
|
Posted - 2010.06.30 14:06:00 -
[11]
So what caused the problem on the SAN in the first place: Broken hardware or misconfiguration?
|

Grez
M. Corp Daisho Syndicate
|
Posted - 2010.06.30 14:11:00 -
[12]
Edited by: Grez on 30/06/2010 14:12:45
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
This has been done to death...
Seems don't know much about DB's. Oracle would be too slow for what CCP need, hampering performance. MySQL is not robust enough compared to other DB's and has retention issues. MSSQL on the other hand is in all honesty, perfect for what CCP need. Don't even bother mentioning other DB's like D2, etc. They're not even in the same category as Oracle and MSSQL.
There is a reason CCP choose MSSQL, and I have enough faith in them to believe they have tested all available DB's and know which is best for TQ.
This is probably one of the many posts that will end up quoting yours and laughing, waiting for your 30+ internet years of professional internet lawyer-ism and database management to aid CCP in their not-so-srs bidniz. ---
|

Raquel Smith
Caldari Freedom-Technologies Eych Four Eks Zero Ahr
|
Posted - 2010.06.30 14:19:00 -
[13]
Originally by: CCP Fallout As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
I work in IT and COMPLETELY UNDERSTAND unforseen happenings as a result of maintenances!
I've had a routine security update corrupt an entire LDAP database, it caused a week of instability and hassle.
Thanks for the blog post.
-- Creator of The Ruby API Library |

Batolemaeus
Caldari Vauryndar Dalharil
|
Posted - 2010.06.30 14:20:00 -
[14]
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
CouchDB?
Nice blog btw., and thanks for explaining what went wrong. We're still missing a few pictures and fancy graphs though. 
|

wapko
The Ankou Systematic-Chaos
|
Posted - 2010.06.30 14:21:00 -
[15]
that burrito analogy.... srsly ...
you did good work. it is much appreciated.. now get some pics and show us the shineys :)
|

Paknac Queltel
Caldari Provisions
|
Posted - 2010.06.30 14:26:00 -
[16]
I, for one, am glad my Raven is still here. 
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
To say MSSQL is not a serious database product is somewhat unfair.
Especially since the corruption mentioned here happened below the database engine.
Do you expect a car to keep driving in a straight line after the ground underneath it disappears? - Paknac Queltel
|

Shintai
Gallente Arx Io Orbital Factories Arx Io
|
Posted - 2010.06.30 14:34:00 -
[17]
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
Troll detected. Or just someone who never worked with a DB or any serious DB atleast.
Hey my DB with 500 entries and nothing else around works all the time. CCP must suck!!  --------------------------------------
Abstraction and Transcendence: Nature, Shintai, and Geometry |

Amy Garzan
Gallente The Warp Rats Intrepid Crossing
|
Posted - 2010.06.30 14:38:00 -
[18]
Pictures? -------------------------------------------------- 101010 The Answer to Life, The Universe, and Everything |

Regat Kozovv
Caldari Alcothology
|
Posted - 2010.06.30 14:42:00 -
[19]
Big Iron is just that. Big. And Hard. "Stuff" happens and you're up till 2AM trying to figure out what happened to your meticulous planning.
I'm sure I can speak for many not posting here in that we appreciate the hard work done.
Also, thanks for the SP reimbursement. I thought it was a perfect way to compensate and a class act.
Thanks for the dev blog and hope you guys learned some new stuff from it. 
Originally by: CCP Atropos THIS IS WHY WE CAN'T HAVE NICE THINGS.
|

Dusty Meg
Shock-Wave Industrys Astro Lux Aedificatiae
|
Posted - 2010.06.30 14:47:00 -
[20]
Great job guys. You chose to go the way that no other game has gone and got the problems with it. Its still a magical thing you can do keeping the TQ server running (some whats smoothly )
|
|
|

Chribba
Otherworld Enterprises Otherworld Empire
|
Posted - 2010.06.30 14:49:00 -
[21]
Edited by: Chribba on 30/06/2010 14:50:07 Expected hardware photos to come to, left dissapointed. But now to read the text... 
Secure 3rd party service | my in-game channel 'Holy Veldspar' |
|

T'ealk O'Neil
|
Posted - 2010.06.30 14:52:00 -
[22]
Would it not be an idea in future when doing any patching / moving to take a backup of the database as it stands before starting - that way a recovery is simple, rather than trying to repair everything, which takes forever
|

Alexa Lanxia
|
Posted - 2010.06.30 15:04:00 -
[23]
You asked for questions so here we go. I've never seen putting "actual load on the storage area network" cause corrupt database tables. You might get routing problems, zoning problems, reservations issues - I've seen many strange problems in the past. But you usually either can access your target LUNs or you can't so I'm not sure what to make of that, care to elaborate? Was it human error or did the actual hardware have a problem?
(What's your switch-vendor anyway, Brocade, Cisco or something more obscure if you don't mind me asking?)
|

Louis deGuerre
Gallente Amicus Morte Shock an Awe
|
Posted - 2010.06.30 15:06:00 -
[24]
Needs more pictures of cabling, hamsters and elephants fighting CCP Soundwave 
Nice blog  Sol: A microwarp drive? In a battleship? Are you insane? They arenĘt built for this! Clear Skies - The Movie
|

Mynxee
|
Posted - 2010.06.30 15:06:00 -
[25]
Thanks for that summary of what happened. I imagine things were extremely stressful on many fronts throughout the entire effort.
Life In Low Sec |

Mabrick
Mabrick Mining and Manufacturing
|
Posted - 2010.06.30 15:21:00 -
[26]
Pesky SANs. If the IT god of thunder had meant for contiguous data to be broken up and spread across hell's half acre of magnetic storage the IT god of thunder would not have created contiguous data! I've always shook my head at the fact that we create monstrous databases to organize our data and then scatter the bits across so many platters with nothing more to cover our backsides than a few thin mathematical algorithms. The vision of nice, clean logically related tables spread willy-nilly everywhere just makes me shudder.
CCP did it right. Good job and many thanks! 
|

Commander Azrael
Red Federation
|
Posted - 2010.06.30 15:22:00 -
[27]
Edited by: Commander Azrael on 30/06/2010 15:25:53
Originally by: T'ealk O'Neil Would it not be an idea in future when doing any patching / moving to take a backup of the database as it stands before starting - that way a recovery is simple, rather than trying to repair everything, which takes forever
Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
Originally by: Alexa Lanxia (What's your switch-vendor anyway, Brocade, Cisco or something more obscure if you don't mind me asking?)
http://www.eveonline.com/devblog.asp?a=blog&bid=769
Primarily Cisco.
|

T'ealk O'Neil
|
Posted - 2010.06.30 15:26:00 -
[28]
Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
|

Laconis Dax
Children of Armok
|
Posted - 2010.06.30 15:27:00 -
[29]
You're doing fine work, CCP. I know that couldn't have been fun or easy.
And thanks for sharing the details with us. Always nice to get my fix of infrastructure ****. 
|

Lolion Reglo
Death Incarnate INC
|
Posted - 2010.06.30 15:31:00 -
[30]
Well thank you for telling us what all happened that day. I was one of the patient ones who said do the job right so it doesn't happen again so i found the xp bonus to be a nice surprise and a nice gift. took 2 days off logistics V for me .
However i still don't think you guys deserved half the stuff on the facebook page that you did. its one thing to be harsh towards you and say get the server up NOW and hold you accountable for your service you provide but an entirely different thing to verbally abuse you guys.
|
|

Commander Azrael
Red Federation
|
Posted - 2010.06.30 15:32:00 -
[31]
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
|

Camios
Minmatar Insurgent New Eden Tribe
|
Posted - 2010.06.30 15:33:00 -
[32]
Next time be more cautious!
Go on with the good work.
|

perix
Minmatar Sane Industries Inc. Initiative Mercenaries
|
Posted - 2010.06.30 15:40:00 -
[33]
Thank you for the information and update I find it interesting to read about a complex environment as tranquillity.
|
|

CCP Yokai

|
Posted - 2010.06.30 15:50:00 -
[34]
Edited by: CCP Yokai on 30/06/2010 15:53:25 PICS!!!
Overhead before cable

Cables

Connecting and testing each and every one of the Ethernet ports

Cleaned up

Overhead view of the cabinets with a look at the air containment transparent tiles

We blew out a shoe

The team for this trip... (not the move team, but one of two prep trips) CCP cNOC CCP Mindstar CCP Yokai CCP Zirnitra

This is all we have for now... all the pre move prep work. The post move pics we'll do next month after we finish up migrating the non TQ items.
|
|

Camios
Minmatar Insurgent New Eden Tribe
|
Posted - 2010.06.30 15:55:00 -
[35]
You were grinning, that means that this photo has been taken before the mess.
|

teko82
Caldari Mark Of Chaos
|
Posted - 2010.06.30 16:06:00 -
[36]
I keep wondering about the whine everytime there is a big change and a longer DT than expected, by now most players should know that you allways put on a long skill for these days. And I do hope that people have other things in their lives they can do during these DT's.. Most pc's have solitare 
The company I work in never have issues this big taking that long to fix, but with around 200K employees they have alot of backup stuff to secure every change. On the other hand, everytime there is a larger change in any of the systems we use, it will normaly be running as it was intended around 2 months after the launch, in Eve it is like max a week . So this is nothing in my opinion ;) And nice pics, feel like my pc's need to be upgraded  All I say is at the account of others! |

Ix Forres
Caldari Vanguard Frontiers Intrepid Crossing
|
Posted - 2010.06.30 16:21:00 -
[37]
I'd just like to echo those congratulating you; I can only imagine how nasty that night/day must've been for you all personally, and I think everyone appreciates the dedicated work along with your diligence in putting the integrity of the game first, and not rushing ahead.
On an entirely unrelated note, how about a complete list of harwdare and network topography? I know a lot of people would be very interested - I know the basics are out there but some of us are truly geeky and love to hear about the architecture, infrastructure (right down to power management and cooling systems) and so on! -- Ix Forres EVE Application Developer EVE Metrics | accVIEW | I Tweet |

Batolemaeus
Caldari Vauryndar Dalharil
|
Posted - 2010.06.30 16:21:00 -
[38]
Originally by: CCP Yokai
PICS!!!
Thank you, the thread is now complete.
|

Pwnzorator
|
Posted - 2010.06.30 16:26:00 -
[39]
Originally by: CCP Yokai
We blew out a shoe
I've seen holes like that before! Someone stood on one of the sticky-up floor bolts in a rack?
Overall, that's a really nice neat install job. Better than most of the cable monsters I've had the misfortune to work with
|

Commander Azrael
Red Federation
|
Posted - 2010.06.30 16:29:00 -
[40]
That's awesome, love the pics :). kudos to CCP for getting the cluster back up and running after a full suite move which are a royal pain in the ass! We consolidated 3 of our suites about a year ago into 1 big suite with 120 racks and that was an absolute nightmare! So I feel your pain :)
Moar pics! 
</geek mode>
|
|

Cinori Aluben
Minmatar Gladiators of Rage Honourable Templum of Alcedonia
|
Posted - 2010.06.30 16:30:00 -
[41]
Edited by: Cinori Aluben on 30/06/2010 16:31:02 CCP Yokai, you are the man. I'm continuing to like you more and more. And nice pics! (Btw how did that shoe get blown out? And was the foot inside unharmed? lol)
Couple things I appreciated from your blog:
Quote: Despite rumors and criticisms to the contrary, our plan included a significant time buffer for the work.
Glad to know this, even though it unfortunately took even longer. I was called unprofessional for suggesting a buffer, but you showed wisdom in figuring in a time buffer.
Quote: VIP Mode is when Tranquility is up, but accessible to CCP staff only (many of you noticed and were curious about why 30+ others were on TQ while you couldnęt login).
Hook a brotha up with VIP invite yo! 
Quote: We all really appreciate the understanding and kind words... and even the harsh ones we needed to hear.
You are a humble, good dev. Keep listening, and keep learning, you can never stop learning. Many CCP could learn from your approach here, and I hope you get credit for such.
I look forward to your responses to all the great IT questions in these comments, and to future server improvements I know you've got coming up the pipe :)
|

Dismas Ofstedal
Minmatar Dead Pilots Society
|
Posted - 2010.06.30 16:40:00 -
[42]
Personal anecdote: Waaay back, when I was a lowly hamster keeping the reel-to-reel tape drives loaded I watched as the techies tried to debug a hardware issue or two on a new mainframe gadget - they had circuit boards strewn all over the floor. They went home to sleep and look at the problem with fresh eyes in the morning. Sometime during the night, a janitor picked up all the cards and they got put through the garbage crusher. 
Seriously, this outage was like the aftermath of a burrito binge. You think the world is ending at the time, but a few days later, it's just a dim memory and you're jonesing for more burritos.
It was a minor inconvenience, just a tiny part of the bigger adventure, and well compensated. I've had poorer service from my former bank.
Thanks CCP.
|

Amida Ta
German Mining and Manufacture Corp.
|
Posted - 2010.06.30 16:56:00 -
[43]
So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause? _________________________ EveAI.Live - The EVE-Online API/class library for .Net, C# and VB.Net |

Tsabrock
Gallente Circle of Friends
|
Posted - 2010.06.30 17:04:00 -
[44]
I had planned for an extended downtime, as being a small-time techie myself I know how much longer things can take to fix. Maybe the Mr. Scott method of time estimation is in order?
I am greatly curious, how long do backups of the TQ database take? How many GB of data are ye dealing with? I know it's vastly more than anything I do with my own service & repair, and have always wondered. --- If you've read something I posted and want to contact me, EVE-Mail me, or contact me via EVE Gate. |

RedClaws
Amarr Black Serpent Technologies R.A.G.E
|
Posted - 2010.06.30 17:16:00 -
[45]
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
root beer 
Nice pictures, we do love pictures of geeky computery stuff
|

Dead Cheese
Gallente Construction Cabal Ishuk-Raata Enforcement Directive
|
Posted - 2010.06.30 17:22:00 -
[46]
Edited by: Dead Cheese on 30/06/2010 17:25:00 As a professional geek myself, I too am a little perplexed at the seeming lack of a special non-scheduled backup. A full DB dump should have been done after the app cluster was shut down, but before the DB cluster was shut down. Transaction logs are only there to save you in the event of an unplanned outage - to catch you up to your current dataset from last night's backup. We shouldn't even be discussing transaction logs with a major planned change such as this.
Furthermore, what make/model of SAN are you running? All major SAN vendors have a snapshot capability. If a snapshot of the DB volumes had been taken just before the suggested DB backup, the DB would have been back up in five minutes. Then the only real downtime would be due to your lengthy testing procedures (which is a superior move, BTW).
I'm so confused. More details for the geeks please!
|

Jowen Datloran
Caldari Science and Trade Institute
|
Posted - 2010.06.30 17:59:00 -
[47]
Nice blog and nice pictures.
I actually spend much of the time when Tranquility was down playing EVE... on Singularity.
Apparently few realized that Sisi was up and running for most of the Tranq downtime. I went and explored some places in 0.0 and wormhole space that I otherwise properly would never get to see.
---------------- Mr. Science & Trade Institute - EVE Online Lorebook
|

Yalluto
Gallente Ascenda Group
|
Posted - 2010.06.30 18:05:00 -
[48]
I think the point that has been brought up by a few about transaction logs being irrelevant as par for course and instead a full backup being performed as a non restarting downtime job are point on. I would have thought that a problem with the SAN on it's initial production run/move causing data corruption would have been as simple as isolating and fixing the cause, and then wiping and slapping on the backup.
Post that, what everybody is missing here is the number of integrity checks that were mentioned and how long they can take. Everybody needs to acknowledge that this step which is paramount to us all is a time consuming process. And the better the integrity checks (sounds like CCP has put a lot of effort into integrity check procedures) the better job they will do.
Ah, and for querry, I'll bet CCP loves the setup they have of imaging the nodes for bringing them back up, which prevents a lot of individual machine teching around. |

RentableMuffin
|
Posted - 2010.06.30 18:33:00 -
[49]
Burrito buying spree.... STOP SPYING ON ME!!!!
|

Wynteryth Fett
|
Posted - 2010.06.30 18:42:00 -
[50]
I'd like to thank CCP Yokai for detailed account of what happened during the downtime..
In the other comments thread, I'd asked these 4 questions.
Quote: 1) Why wasn't the new equipment and one system set up weeks in advance to prevent this extended down time? 2) Has CCP taken the precautions necessary to ensure the "database errors" *wink wink* don't re-occur going forward? 3) Does the new server location have the infrastructure to ensure that the system doesn't go down due to something minor like a bad PDU or short-term loss of power? 4) Are the servers with the new equipment set up in at least a 2N redundancy so something as simple as hardware failure doesn't shut the game down?
As I mentioned in that other thread, I have a lot of background on this from a Disaster recovery standpoint, but know that the concepts still hold true..
It's good to know that you all had much of the networking items set up ahead of time. I'd like to ask if the servers are set up in a 2N redundant system. If they are, why wasn't only one of the systems moved ahead of time? If the servers aren't in a 2N redundant system, why not?
Has the cause of the Database errors been determined? Could they have been found ahead of time with a 2N redundant system?
Also, Is there one GIANT Database for everything? or are there Multiple ones (One for plantets/stations, one for items, one for player info)? Without giving away any potentially proprietary CCP information, could you tell us if there is one giant database or several smaller ones that link to it?
|
|

Jackie Fisher
Syrkos Technologies Joint Venture Conglomerate
|
Posted - 2010.06.30 19:19:00 -
[51]
Edited by: Jackie Fisher on 30/06/2010 19:21:15
Originally by: CCP Yokai
Connecting and testing each and every one of the Ethernet ports

Solved your problem for you - there is a tramp living in your kit. Don't buy any Big Issues from him or he'll never leave.
|

Dillon Arklight
Universal Army Ushra'Khan
|
Posted - 2010.06.30 19:25:00 -
[52]
Originally by: Jackie Fisher Edited by: Jackie Fisher on 30/06/2010 19:21:15
Originally by: CCP Yokai
Connecting and testing each and every one of the Ethernet ports
Solved your problem for you - there is a tramp living in your kit. Don't buy any Big Issues from him or he'll never leave.
  
Thanks for the AAR, just goes to show how much work goes into keeping TQ healthy.
|

Yuda Mann
|
Posted - 2010.06.30 19:45:00 -
[53]
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
Leave that poor dead horse alone. If you think MSSQL isn't a serious database product then I invite you to come back to this universe instead of the one you're floating in. At the time the devs were looking at db's, MySQL sucked for massive projects and couldn't do the things they needed. At that time as well, MSSQL was the best option.
The devs have already explained why they'd never switch as well. You might notice that it's pretty much basic common sense too. Then again, you might not.
http://www.eveonline.com/iNgameboard.asp?a=topic&threadID=1095044&page=5#123
The answer to this question is simple, the cost of redevelopment is huge. You can't imagine the amount of code we have in T-SQL. It will always be cheaper to buy more hardware than reinvest in MySQL, an investment that may or may not give you some performance. You can't know if it will give you performance benefits until you have been able to make EVE work on it.
That is why we have not even considered changing our database platform. That's as simple as that. We are also very happy with SQL Server.
As I said before I don't like platform debates, so I will probably not post more in this thread, as I don't want to be pulled in. ---- Senior Virtual World Database Administrator Operations department CCP Games HI! |

Sturmwolke
|
Posted - 2010.06.30 20:45:00 -
[54]
Edited by: Sturmwolke on 30/06/2010 20:45:38
Originally by: Dead Cheese As a professional geek myself, I too am a little perplexed at the seeming lack of a special non-scheduled backup. A full DB dump should have been done after the app cluster was shut down, but before the DB cluster was shut down.
This tbh, a failed transfer/corruption scenario should have been anticipated and included in a failover/recovery plan for this move. That's assuming you guys had such plans ready before initiating the move. It's almost as if you're doing it on-the-fly and then crossing your fingers for the best that nothing screws up.
|

Xornicon Altair
Woopatang Primary.
|
Posted - 2010.06.30 21:09:00 -
[55]
Stop using Microsoft to run your database. Not only will you have less issues with the database, but, performance will increase by not requiring the ridiculous overhead that Microsoft Operating Systems demand. Just my opinion, but, there are so many better options out there than MS. ----- CCP FAILS AGAIN! WHERE ARE THE ALLIANCE LOGOS??? |

Dalilus
|
Posted - 2010.06.30 21:18:00 -
[56]
Let's use a fishing metaphore.....you and your friends plan a fishing trip months in advance. Since you know you will be raked over burning coals, tar and feathered not to mention severily ridiculed, if your boat does not work properly a mechanic is hired to go over the engines, transmissions, etc., making sure everything is ship shape. That faulty RPM gage is replaced, battery charger and batteries checked, fuel lines and tanks cleaned, oil and air filters changed, toilet flushed a few times making sure it works, paperwork and fishing permit tripple checked, safety equipment and radio onboard and working. You are set.
Ideally you would go fishing with friends spending 8 - 10 hours on the water and getting a ton of fish, the beginnings of a sunburn, hours of video, tons of photographs, chugging cases of beer, empting many bladders, eating sushi, sandwiches and ceviche, swimming with the fish, all in all a great fishing day. Back at the dock you or your boat boy would clean the boat and tackle, loading fuel and bait for next days fishing, checking the hot engines making sure everything is in order and finally catching up with your friends to wash up and party all night at the local drinking hole.
Instead you wake up early to go and check oil and water levels before your friends show up. When you try to fire up the engines, you find out that you and your friends cannot go out because the manifold on one of the engines is cracked and it takes a day or two to get it repaired/replaced. Let the roasting begin as you scramble to find a rental, on your nickle, while your pride and joy is being pulled out of the water to sit on the hill while it is repaired.
IMO CCP did good.
|

Adonais Templar
Minmatar DemSal Corporation DemSal Unlimited
|
Posted - 2010.06.30 22:18:00 -
[57]
Having worked in the industry I have a good idea what it tooked to make the move. Must say good job, I thought personally the downtime was an underestimation. Kudos on fixing the db problem so fast, databases are probably the worst problem to fix. The bonuses from the longer than expected downtime was worth it. Planning any long downtimes again ;-).
|

Knaar
|
Posted - 2010.06.30 22:53:00 -
[58]
Edited by: Knaar on 30/06/2010 22:53:40
Originally by: Xornicon Altair Stop using Microsoft to run your database. Not only will you have less issues with the database, but, performance will increase by not requiring the ridiculous overhead that Microsoft Operating Systems demand. Just my opinion, but, there are so many better options out there than MS.
http://en.wikipedia.org/wiki/Vendor_lock-in
|

Glowstix
Minmatar Haters Gonna Hate
|
Posted - 2010.06.30 22:53:00 -
[59]
Thanks for another top notch job, CCP. You guys put up with a TON of (mostly undeserved) abuse, and still came through with a fix and then share what happened and what you did with us in detail right after. Not only do you go farther than other companies who just leave it at "downtime took longer than expected due to technical issues", but you also explained how you chose to do a much more detailed and meticulous fix as opposed to taking the quick and easier way of just rolling back. Even when people are screaming to the heavens, and some being quite childish, you guys are patient and try to work out the best solution for us.
You guys rock.
<3
|

Knaar
|
Posted - 2010.06.30 23:14:00 -
[60]
I just wanted to say that you guys are doing an awesome job. Despite coming up against Finagle's Law you made the right decisions.
One thing you should do is realize that every whiner is actually a hopelessly addicted customer that needs his/her fix and gets super grumpy without it. We wouldn't be hopelessly addicted unless you all were doing something severely wonderful. So in reality every whine is just an admission of your magnificence in disguise.
|
|

Zathi Shaitan
Illiteracy Combatants
|
Posted - 2010.06.30 23:34:00 -
[61]
MSSQL always was a fail cascade, is still a fail cascade, and will continue being a fail cascade.
---- http://loseloose.com/
http://youryoure.com/
|

Swidgen
|
Posted - 2010.06.30 23:59:00 -
[62]
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
LOL, you were expecting actual information? This is one of the least informative dev postings in a long time, and that's quite an accomplishment. |

Itseban Tvi
|
Posted - 2010.07.01 00:17:00 -
[63]
Appreciate all the hard work, and the swift explanation and skill repayment. I seriously winced for you guys when I heard about what was happening. I have been there.
|

Comnitus Ultima
|
Posted - 2010.07.01 00:20:00 -
[64]
Originally by: Knaar I just wanted to say that you guys are doing an awesome job. Despite coming up against Finagle's Law you made the right decisions.
One thing you should do is realize that every whiner is actually a hopelessly addicted customer that needs his/her fix and gets super grumpy without it. We wouldn't be hopelessly addicted unless you all were doing something severely wonderful. So in reality every whine is just an admission of your magnificence in disguise.
Well... yeah, but... eh... but I mean... err... ahh...
Damn, he's right.
|

cBOLTSON
Caldari Shadow Legion. Talos Coalition
|
Posted - 2010.07.01 00:27:00 -
[65]
Edited by: cBOLTSON on 01/07/2010 00:28:53 I guess sometimes **** happends and you have to do the best you can. Thanks for taking time to explain what happened in this blog :)
EDIT: I too would like to know what the root cause of the problem was (-_o)
|

Daan Sai
OHiTech
|
Posted - 2010.07.01 01:33:00 -
[66]
Good recovery under pressure. You can never have too many backups! Hope the root cause wasn't an unplugged node :)
--------------------------------- Internet Submarines is Serious Business ---------------------------------
|

VaL Iscariot
|
Posted - 2010.07.01 01:34:00 -
[67]
To be honest it made me sick to read all the people whining and complaining about how long it was taking. Everyone was given a full weeks warning that this downtime was going to happen, and to be prepared for it. Instead, the so called 'mature' player base of Eve Online was found to be no more then a bunch of World of Warcraft drop out whiners. It was only a day and a half and people were up in arms about "HOW DARE CCP TRY TO UPGRADE THE GAME I PLAY EVEN THOUGH THEY'RE DOING THE EXACT THING I WHINE ABOUT AND WISH THEY'D DO ON A DAILY BASIS!! I WANT FREE NAO!!1!" the worst part being that CCP obliged them, thus giving the minority that post on the forums more voice then what they deserve.
Next time CCP, dig through the f*cktards and just ignore them. Giving in to them only makes it worse. Also, don't be insulted by them either. They don't know what it takes. (though I'm sure your going to find a so called 'game developer for Call of Duty 4, and a few 'Blizzard programmers' in here too to tell you just how its done )
Thanks for all the fish VaL
|

MC187
|
Posted - 2010.07.01 02:22:00 -
[68]
http://i3.photobucket.com/albums/y67/nl37tgt/automotivator.jpg
hehe
|

Syekuda
State Protectorate
|
Posted - 2010.07.01 02:31:00 -
[69]
Originally by: MC187 http://i3.photobucket.com/albums/y67/nl37tgt/automotivator.jpg
hehe

on a serious note, I hate to say this but it must of been a very difficult decision. I think you didn't expect that much hatred. I guess thats a proof of love to this game
just a small request, were all (well some of us are anyway) adults, please tell us the real time it takes and dont update the news so in the next 30 minutes its going live...when it dont go live and your not sure. If it takes another 6 hours, fine just be straight about it, be honest. We can take it.
|

Lillandra Peregrine
|
Posted - 2010.07.01 03:05:00 -
[70]
nice work ccp. and thanks for explaining what happened, appreciate the transparency. :)
|
|

Asperath Fernandez
|
Posted - 2010.07.01 03:08:00 -
[71]
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
Not understanding the context of the word "transaction" in this case, ftw. 
|

Hiyoshi Maru
|
Posted - 2010.07.01 03:20:00 -
[72]
I have worked in IT for over 15 years now and seen a lot of projects and rollouts in a lot of places (NZ, Oz, UK, Netherlands, Germany, Singapore to name a few). It is extremely rare things go smoothly, there will always be an issue.
It is excellent to see not only did you take the time to sort the issue correctly rather than hammer a fix in and deal with it later, but you have also publically stated the essence of what happened. This is extremely rare and very commendable.
Thanks for the hard work you do, the efforts made by all the unseen people in the background who make this work for all of us, and thanks for your honesty. It is this kind of relationship that CCP tries to engender that makes EVE the brilliant game that it is.
|

Rhok Relztem
Caldari CGMA Synergist Syndicate
|
Posted - 2010.07.01 04:53:00 -
[73]
All I have to say is...
CCP and everyone involved from top to bottom - one very big class act.
I sincerely hope some of the other game developers out there are taking notes.
|

Niccolado Starwalker
Gallente Shadow Templars
|
Posted - 2010.07.01 05:41:00 -
[74]
Originally by: CCP Fallout As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
As always, good work CCP!
Originally by: Dianabolic Your tears are absolutely divine, like a fine fine wine, rolling down your cheeks until they flow down the river of LOL.
|

Monkey M3n
The Collective Against ALL Authorities
|
Posted - 2010.07.01 05:47:00 -
[75]
stupid IT noobs
tl;dr You suck
|

Nofonno
Amarr Aperture Harmonics K162
|
Posted - 2010.07.01 07:37:00 -
[76]
After several years in EAO (enterprise aplications operations) in a major multinational corp, I've had also had my share of failed moves and transitions.
Also, I've read, and also composed, many after action reports that smoothed the actual ****-up we made for the customer, so no-one would get too ****ed and we'd play it together for years to come.
Though I'm an UNIX guy and know squat about M$ enterprise software range, but I know my share about SANs. This all smells rather of a human error that a technical one (as it usually is in IT) -- something must've happened during the trip to the actual disk array drives, most probably one or more were mishandled and produced unrecoverable data errors.
Who knows... I, as a paying customer, don't mind too much, since all I had is in its place and I'm not in a time pressure. It could've ended much worse, so, kudos to CCP.
Better luck next time  ---
A scientist must be an optimist at heart - to have the strength to rally against a chorus of voices saying "it cannot be done". |

Apaximander
|
Posted - 2010.07.01 07:42:00 -
[77]
MMO players really need to stop whining about outages. It's the nature of a game like this that there will be unexpected downtime; it happens. It's not as if we were locked out for a week or anything, either.
|

sjw7
Caldari CompleXion Industries
|
Posted - 2010.07.01 08:08:00 -
[78]
Like some others who have posted here I have spent many years working in Enterprise IT from support to design and implimentation of all types of projects similar to the one CCP have been telling us about.
It wasnt clear from the blogs but it seems that the move to the new room included some new kit as well as the reuse of some of the old kit. This is always a pain but good planning can minimise configuration issues. The explanation of the database corruption seems odd. A poorly configured SAN will either slow an application down or stop it working completely. The only time i have come across any kind of database problem which was caused by the SAN is in the Quorum disk of an MS cluster when using storage level replication. It was a very specific event that caused the problem and the corruption was with the cluster and not the actual data itself. It deffinately seems that someone didnt follow the golden rule of system upgrades which is 'Take a backup first.' As Eve was shut down a backup would clear the SQL transaction logs and there would be no need to replay them to get data back. Someone should really hold their hand up and say that they messed this bit up rather than blame the hardware.
For those saying that its just a game and people should quit complaining I will point out that Eve is a paid for service. Just because its a game doesnt change this one bit. If your TV company stopped sending you a signal for a day and a half you would probably complain its the same with a whole host of other services you pay for as well.
Also for those Microsoft haters out there you need to realise that mySQL is not an alternative to MSSQL in an enterprise environment. Firstly it doesnt scale and secondly when things have gone horribly wrong you can call Microsoft (as long as you have the support agreement) and you will get the problem fixed. I have had to do this in the past and after dealing with many vendors support departments I can assure you that MS is one of the very best when you have a premier support agreement. You will get nothing like the same level of support with mySQL which is best suited to running forums and small installations. The other big player is Oracle but once you pick one you dont change as its a hell of alot of work.
|

Temai
Gallente
|
Posted - 2010.07.01 08:54:00 -
[79]
Thanks for working so hard to get the servers back up, i work in IT my self.. and i know how it feals whena network die's..and lots of not happy people know where the server is and wait for you to arrive to "ask" you to fix it... remind me to recharge my tazer
on another note blowing out shoes proof that your doing something right.. you cant build good stuff with out melting something... or seting fire to someone.. ^^
- Temai Row Row Fight The Power - Libera Me
|

Fujiko MaXjolt
Caldari Templar Republic
|
Posted - 2010.07.01 09:23:00 -
[80]
To put things in perspective, "that other" MMO out there had a patchday yesterday that was supposed to be 12 hours, but got extended to 18 hours over 3 times, no explanation/information at all...
My wife was not a happy Panda - atleast with EVE we get info and a complete debriefing after the fact ;-)
Oh, and also we got a nice gift :-D
|
|

Libin Herobi
|
Posted - 2010.07.01 10:00:00 -
[81]
Looking forward to not receiving any answers in this thread.
|
|

CCP Yokai

|
Posted - 2010.07.01 10:11:00 -
[82]
Originally by: Libin Herobi Looking forward to not receiving any answers in this thread.
Patch day today.
I'll start grinding through responses after downtime today.
Thanks!
|
|

Gallosek
|
Posted - 2010.07.01 11:14:00 -
[83]
Originally by: sjw7 Also for those Microsoft haters out there you need to realise that mySQL is not an alternative to MSSQL in an enterprise environment. Firstly it doesnt scale and secondly when things have gone horribly wrong you can call Microsoft (as long as you have the support agreement) and you will get the problem fixed. I have had to do this in the past and after dealing with many vendors support departments I can assure you that MS is one of the very best when you have a premier support agreement. You will get nothing like the same level of support with mySQL which is best suited to running forums and small installations. The other big player is Oracle but once you pick one you dont change as its a hell of alot of work.
I would like to point out that mySQL is owned by Oracle, and was owned by SUN Microsystems before that. Either of whom would provide you the support you commend Microsoft for and yes, configured correctly mySQL scales nicely.
Not that I am suggesting CCP should migrate to mySQL. The pain caused by the incumbent solution (of any description) usually has to be very high to justify the enormous costs of migration (for any job you do, work out what it costs based on the number of hours it takes against your hourly salary, for laughs see how much income you waste by needing sleep).
|
|

CCP Yokai

|
Posted - 2010.07.01 11:32:00 -
[84]
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
This is correct we did have our normal backup that was a few hours out of date. And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
|
|
|

CCP Yokai

|
Posted - 2010.07.01 11:34:00 -
[85]
Originally by: T'ealk O'Neil Would it not be an idea in future when doing any patching / moving to take a backup of the database as it stands before starting - that way a recovery is simple, rather than trying to repair everything, which takes forever
Bonus points for being the first to say it... again. Dead right and the process now on everything even remotely risky.
|
|
|

CCP Yokai

|
Posted - 2010.07.01 11:36:00 -
[86]
Originally by: Camios Edited by: Camios on 30/06/2010 15:59:58 You were grinning, that means that this photo has been taken before the mess.
cool pics btw
This was one of the last prep moves before the TQ move. Mainly just making sure the Ethernet systems were in good order.
|
|
|

CCP Yokai

|
Posted - 2010.07.01 11:45:00 -
[87]
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
I waited on posting the exact details until after I had quite a few or our vendor experts chime in to make certain we had the root.
One of the links to our RAM SAN Storage corrupted data being written to the storage device.
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
I hope that helps clarify... Where the root of the issue that caused the corruption was and what we fund to be the problem. The nuking from orbit bit is not my preferred method of troubleshooting, but again... choice of get TQ online faster or satisfy my desire for empirical data... I choose TQ.
|
|

Sturmwolke
|
Posted - 2010.07.01 11:56:00 -
[88]
Lol, as suspected. However, I greatly appreciate the candidness in replies from CCP Yokai. If anything, thumbs up for showing accountability. |

Hack Harrison
Caldari
|
Posted - 2010.07.01 12:11:00 -
[89]
Originally by: CCP Yokai
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
I waited on posting the exact details until after I had quite a few or our vendor experts chime in to make certain we had the root.
One of the links to our RAM SAN Storage corrupted data being written to the storage device.
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
I hope that helps clarify... Where the root of the issue that caused the corruption was and what we fund to be the problem. The nuking from orbit bit is not my preferred method of troubleshooting, but again... choice of get TQ online faster or satisfy my desire for empirical data... I choose TQ.
Its the only way to be sure!!!
|

Moraguth
Amarr Dynaverse Corporation Sodalitas XX
|
Posted - 2010.07.01 12:12:00 -
[90]
Thank You CCP Yokai. I know how the urge to find the exact problem would be almost overwhelming. Thank you for the details too.
Originally by: CCP Yokai
Originally by: Amida Ta So the big question remains unanswered in the blog:
"after finding the root cause"
So what was the root cause?
I waited on posting the exact details until after I had quite a few or our vendor experts chime in to make certain we had the root.
One of the links to our RAM SAN Storage corrupted data being written to the storage device.
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
I hope that helps clarify... Where the root of the issue that caused the corruption was and what we fund to be the problem. The nuking from orbit bit is not my preferred method of troubleshooting, but again... choice of get TQ online faster or satisfy my desire for empirical data... I choose TQ.
good game
Hoc filum tradit - This thread delivers.
|
|

Vultaras
|
Posted - 2010.07.01 12:14:00 -
[91]
Originally by: VaL Iscariot To be honest it made me sick to read all the people whining and complaining about how long it was taking. Everyone was given a full weeks warning that this downtime was going to happen, and to be prepared for it. Instead, the so called 'mature' player base of Eve Online was found to be no more then a bunch of World of Warcraft drop out whiners. It was only a day and a half and people were up in arms about "HOW DARE CCP TRY TO UPGRADE THE GAME I PLAY EVEN THOUGH THEY'RE DOING THE EXACT THING I WHINE ABOUT AND WISH THEY'D DO ON A DAILY BASIS!! I WANT FREE NAO!!1!" the worst part being that CCP obliged them, thus giving the minority that post on the forums more voice then what they deserve.
Next time CCP, dig through the f*cktards and just ignore them. Giving in to them only makes it worse. Also, don't be insulted by them either. They don't know what it takes. (though I'm sure your going to find a so called 'game developer for Call of Duty 4, and a few 'Blizzard programmers' in here too to tell you just how its done )
Thanks for all the fish VaL
Thank you, i can not say better. You are my hero 
Damn you childish whiners
|

Libin Herobi
|
Posted - 2010.07.01 12:36:00 -
[92]
Originally by: CCP Yokai
Originally by: Libin Herobi Looking forward to not receiving any answers in this thread.
Patch day today.
I'll start grinding through responses after downtime today.
Thanks!
Well done! I can post this more often if you think it helps with gettings answers... 
|

Nofonno
Amarr Aperture Harmonics K162
|
Posted - 2010.07.01 12:58:00 -
[93]
Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work...
Originally by: CCP Yokai
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking.
I'm stumped.  ---
A scientist must be an optimist at heart - to have the strength to rally against a chorus of voices saying "it cannot be done". |

Ban Doga
|
Posted - 2010.07.01 13:02:00 -
[94]
Edited by: Ban Doga on 01/07/2010 13:02:21
Originally by: CCP Yokai
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
This is correct we did have our normal backup that was a few hours out of date. And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
What in the world made you believe you should start such an undertaking without having an up-to-date backup of your data?! Sorry to say this, but I guess a large part of the additional downtime last week was necessary because you guys did not have that; thus being unable to simply revert the DB.
If you do a once every other year procedure your first question before touching anything should always be "Do we have a brand-new backup?" But it's great to see you started to adopt this widely spread technique into your daily business.
|
|

CCP Yokai

|
Posted - 2010.07.01 13:26:00 -
[95]
Edited by: CCP Yokai on 01/07/2010 13:27:26 Edited by: CCP Yokai on 01/07/2010 13:27:07
Originally by: Nofonno Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work...
Originally by: CCP Yokai
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking.
I'm stumped. 
It is pretty complicated and why it wasn't in the initial posting while we verified. But hundreds of thousands of both in and out of frame errors without crc errors on corrupted payload and you have the problem we had.
|
|
|

CCP Yokai

|
Posted - 2010.07.01 13:34:00 -
[96]
Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Originally by: Commander Azrael
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
|
|

Rashmika Clavain
Gallente
|
Posted - 2010.07.01 13:43:00 -
[97]
Well the main thing is you're adopting new or adapting existing processes to help mitigate future related problems when doing any form of planned outage.
|

Dokten Ral
|
Posted - 2010.07.01 14:05:00 -
[98]
Just wanted to say: THIS kind of great community service and open honest caring about the community is one of the things that I love most about all you guys at CCP, and is the kind of thing I tell anyone who's interested in EVE about. Not only is it fun and interesting to get a glimpse behind the scenes, but it's great to see that CCP really does care about providing us with great service. CCP certainly doesn't have to provide us with the kind of quality support they do, but I have yet to see anyone else anywhere near as good of a job.
|

Lyonsban
Gallente Wreakage R Us
|
Posted - 2010.07.01 16:36:00 -
[99]
Thanks for the SP. I didn't expect it, wouldn't have missed it, but appreciate it.
I'm glad to see you had and used a fallback plan to "nuke from orbit". You'd be amazed how many projects don't use decent risk management tools. My appreciation of your professionalism.
MSSQL? Bah, over-developed, under-planned desktop database. It's like using a successor to Windows 95 on your mainframe. Oh, wait...
|

Beauregard Jackson
Minmatar Old Bastards Club
|
Posted - 2010.07.01 17:14:00 -
[100]
I'll pile on with the kudos. Many thanks to CCP for being forthcoming and honest with what went wrong, and the extreme effort that went into fixing it. Compare that with Google, "Yeah, GMail barfed. It's back now. Carry on."
I know from experience, all-nighters in the server room is not much fun.
|
|

Darek Castigatus
Immortalis Inc. Shadow Cartel
|
Posted - 2010.07.01 17:35:00 -
[101]
understood pretty much nothing that yokai has said so far but appreciate him saying it anyway, now to go get my resident IT nerd friend to translate it for me. On a slightly more serious note i dont think ive ever seen any other MMO company be this open about their internal operations, major props to all CCPeople (hur hur, i made a funny )
Hopefully now it'll shut up those endless 'CCP never tells us anything' whiners
http://desusig.crumplecorn.com/sig.php |

Akita T
Caldari Caldari Navy Volunteer Task Force
|
Posted - 2010.07.02 03:34:00 -
[102]
Recovering from such a massive (and mostly hardware, from the looks of it) failure in less than a day (time from discovery of problem to complete fix with thorough enough testing, that is), that's actually pretty decent. Getting new procedures in place to make sure if such an unlikely event happens again, it will go even faster to fix it, that's even better. Nicer new home for TQ, awesome either way.
Can't say I was very pleased the server was down for so long, but all things considered, thumbs up for an overall good job.
_
Beginner's ISK making guide | Manufacturer's helper | All about reacting _
|

Paladin Taggart
|
Posted - 2010.07.02 04:16:00 -
[103]
Wow. First of all thanks for doing such a great job on such a massive system.
Also I want to praise your honesty. It would have been very easy to stick with a vague answer to all the questions. But instead you admitted that you had a large human error coupled with an ill-timed hardware failure. It takes a great amount of courage to tell us the truth.
<golf_clap>Thank you!</golf_clap>
|

Alain Kinsella
Minmatar
|
Posted - 2010.07.02 06:42:00 -
[104]
Thanks Yokai, very appreciated, especially to an SL migrant. 
Quote:
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now.
I find this interesting, since (as a RamSan user) I did not think they had snap features (though we use the SSDs, not their Flash products). Are you using something else in-between?
BTW, we found a product that blew away the RamSan in speed - FusionIO. Any chance you'd drop those into the MS-SQL boxes, if only to enable better local caching? Or do the internal RamSan devices work OK in that respect (per last year's fanfest tech/vendor video)?
|

Pitty Hammerfist
|
Posted - 2010.07.02 07:38:00 -
[105]
Awesome work, must've been a very tense 48 hours.
i know the feeling of watching network screw up with people getting angry wondering why lol, but not to the extent of 400k people screaming :)
|

kKayron Jarvis
Caldari 5th Front enterprises Chain of Chaos
|
Posted - 2010.07.02 08:06:00 -
[106]
on the 6/25/2010 Open petitions Subject: lost 165 of mechanical parts
It looks like I have lost 165 or so of mechanical parts from **** It looks like my extractors that was doing base metals have chand into heavy metals extractors
it looks like it is down to this i thinks
|
|

CCP Yokai

|
Posted - 2010.07.02 11:09:00 -
[107]
Just wanted to throw out a quick "Thank you" to everyone for posting, replying, helping out. I do appreciate the comments and plan to keep delivering more info on future projects as soon as possible.
I'll be monitoring this thread a bit less than F5 every few hours now... so apologies if I'm delayed on future replies.
Thanks again CCP Yokai
|
|

Lykouleon
Trust Doesn't Rust
|
Posted - 2010.07.03 05:52:00 -
[108]
I'm suprised no one else has asked...
WHAT HAPPENED TO THE SHOE?!?!?! 
Quote: CCP Mindstar > Sorry - I've completely messed all that up. lets try again
|

Esrevid Nekkeg
Justified and Ancient
|
Posted - 2010.07.03 14:07:00 -
[109]
Originally by: Lykouleon I'm suprised no one else has asked...
WHAT HAPPENED TO THE SHOE?!?!?! 
Didn't you read? It blew out. It probably got overheated to much....
And to the people at CCP that where responsible for this job: I'm very, very pleased to see that you are honest and willing to learn from mistakes made. And above all, are not afraid to admit it! **takes hat off and bows**
|

Silicon Buddha
Amarr Capital Construction Research Pioneer Alliance
|
Posted - 2010.07.03 22:44:00 -
[110]
Originally by: CCP Yokai Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
Snaps are the way to go. We snap our huge production DBs in a 7x24 enterprise NOC every 4 hours. We don't even bother with tlog backups anymore as we have the fulls. SQL just freezes for a few seconds so that all the data can be committed to disk and then the snapshot is taken and the DB unfrozen. All told takes about 3-5 seconds of "freezing" and all our apps are at least that resilient.
Restores are quick as heck as well (fortunately only needed then a few times).
The downside to snaps is the enormous amount of disk space you need to store all those blocks, but it does work very well.
I'm sure any of us who love the enterprise geek stuff would LOVE to help out our favorite game if you ever want/need to bounce anything off people. _________________________________________________________ Click here for Fly Reckless Podcast
|
|

Koragoni SkyKnight
|
Posted - 2010.07.04 00:28:00 -
[111]
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
This made me raise an eyebrow... as a fellow IT professional, and one that has worked in this industry my entire professional career. Balancing against the fact that I work with small business almost exclusively and this type of hardware is beyond me... I can't help but think you have a massive procedural problem that needs addressed.
You never, and I mean NEVER, migrate a system without a full backup. I don't care if it takes 2 hours to copy that database, a copy should have been made before the new rack was brought online. If I made the mistake you guys made, costing any of my clients the money of repair... my employment would have been terminated faster than you can say, "oops".
I'm glad you guys got it sorted out, and the repair itself is a testament to your technical abilities. So I'm left confused as to how such a well put together crew of people could have fallen victim to such a basic mistake. Still you appear to be working on ensuring you learn from the mistake, and that is all anyone can really ask.
|

Rip Minner
Gallente ARMITAGE Logistics Salvage and Industries
|
Posted - 2010.07.04 01:53:00 -
[112]
Originally by: Chiana Torrou Contrary to many others who post on the forums I still think you all did a really good job in the face of very difficult circumstances.
Thanks for all the hard work - and the free skill points
This covers it Is it a rock? Point a Lazer at it and profit. Is it a ship? Point a Lazer at it and profit. I dont realy see any differnces here. |

Zwaliebaba
|
Posted - 2010.07.05 08:13:00 -
[113]
Originally by: Raquel Smith
Originally by: CCP Fallout As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
I work in IT and COMPLETELY UNDERSTAND unforseen happenings as a result of maintenances!
I've had a routine security update corrupt an entire LDAP database, it caused a week of instability and hassle.
Thanks for the blog post.
Yeah, but than you change the application logic, which was not the case here. Me was always told to make an offline backup before you change anything to prevent a roll-forward...
|

Salyan
|
Posted - 2010.07.05 09:20:00 -
[114]
Originally by: Koragoni SkyKnight
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Right now... about 2 hours.
You never, and I mean NEVER, migrate a system without a full backup.
This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them:
http://www.linuxtoday.com/high_performance/2009101901035NWMS
At least CCP had a better plan that microsoft!
|

Libin Herobi
|
Posted - 2010.07.05 11:48:00 -
[115]
Some things take longer to process than others.
Originally by: CCP Yokai
... And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Originally by: Commander Azrael
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
Originally by: CCP Fallout EVE Online: Tyrannis 1.0.3 will be deployed on Thursday, July 1, 2010. Deployment will start at 11:00 UTC and is expected to be completed at 12:30 UTC. Patch notes are available for review.
Originally by: FailSafe Kari Online and no extensions ^_^ I'm impressed CCP good Work
It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible... 
|

wizard87
|
Posted - 2010.07.05 14:21:00 -
[116]
Edited by: wizard87 on 05/07/2010 14:25:17 New architecture: Have a failover SAN that also updates during DT for additional data ?
New procedure (amazed if its not roughly your normal procedure): Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster.
You suggest your backup was not up to date (or checked for data integrity) by the "burritos" analagy - that would be pretty criminal for a system of hundreds of thousands of users. Tell me it isn't so.
Or it sounds like you were 1 step from a failed disaster recover and bye bye eve data. I wonder if that kind of mismanagement is how Eve will end one day?
Ps - I was a server admin quite a few years back for only a couple of years so may not know what I'm on about with today's hardware.
|
|

CCP Yokai

|
Posted - 2010.07.05 16:17:00 -
[117]
"Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
|
|
|

CCP Yokai

|
Posted - 2010.07.05 16:21:00 -
[118]
Originally by: Libin Herobi Some things take longer to process than others.
It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible... 
The backup was started at 09:00GMT and the nature of the backup we are doing backs up and adds changes live to the backup. It was completed at about 11:05GMT. Backups do not have to start after everyone leaves the game. So, yes, we did a complete backup before the Tyrannis 1.0.3 patch.
I hope that helps clarify.
|
|

wizard87
|
Posted - 2010.07.05 17:04:00 -
[119]
So you only do incremental backups during DT normally you mean?
Sounds like some of the clients I used to work for - daily incrementals and weekly full backups but I always hated those incrementals cos the data often (inevitably) used to be buggered when you needed it most.
Thanks for being so open about it anyway. I don't know enough about the solution your using really so I'm basically a guy down the pub offering advice.
However I know datacentre cabinets dont come cheap so space may be an issue, but have you considered having redundancy (I think it used to be called) in terms of a failover DB/back end you can flip to so customers can still use the application and the outages dont keep the customers waiting. You can then sync the data (or flip back to the primary DB/back end assuming your using pretty standard 3-tier architecture) effectively from the next backup/DT after the issues/upgrades/moves etc are resolved?
|

Some Advisor
|
Posted - 2010.07.05 18:17:00 -
[120]
Originally by: Salyan This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them:
http://www.linuxtoday.com/high_performance/2009101901035NWMS
At least CCP had a better plan that microsoft!
wow. thats quite deep and lengthy, and very interesting
--- Donations, thankyou / hatemails always welcome :P if you want to "ragequit" or take a longer break: "can i have your stuff" ? :P i also like BPOs of any kind with the promise you get it back :) |
|

Koragoni SkyKnight
|
Posted - 2010.07.05 19:06:00 -
[121]
Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
I never suggested a live backup of the database for every maintenance cycle. That isn't realistic given the 1 hour window. The server migration however, YOU defined the downtime requirements. That migration should have started with a backup, there is no excuse or logic that can change that fact.
Also, if you're still doing hot backups to tape at this point you need even more help. As you've indicated validation of the backup on tape media takes too long. You need hard disk based storage large enough to house a single copy of tranqulity. This storage is only needed in the case of migration. Of course if you're looking into a SNAP solution you're already moving in this direction.
Tape is only good for long term archival use, and I submit useless for the game server. You need that for your financial records, not for the game's database. Unless you guys are wanting to keep copies of the DB around for 10+ years for later study. Not an entirely worthless exercise, especially given the unique nature of the Tranquility cluster.
|

IngarNT
Minmatar hirr Morsus Mihi
|
Posted - 2010.07.06 07:15:00 -
[122]
Errors on the san? its brocade right?
it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch.
if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM.
Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path |

Sjolus
Metafarmers MeatSausage EXPRESS
|
Posted - 2010.07.06 11:31:00 -
[123]
Yokai, I have nothing other to add than huge, HUGE props for delivering tasty technical tidbits and information such as this. Especially regarding specifics around prolonged downtimes. This is, at least for me, VERY satisfying to read.
Thank you for delivering actual information <3 |
|

CCP Yokai

|
Posted - 2010.07.06 12:34:00 -
[124]
Originally by: IngarNT Errors on the san? its brocade right? (YES)
it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch.
if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM.
Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path
Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery... |
|

IngarNT
Minmatar hirr Morsus Mihi
|
Posted - 2010.07.06 18:47:00 -
[125]
Originally by: CCP Yokai
Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery...
It may be worth asking ram-san if they will support class 2 IO traffic, not meny people do (outside of flogi processes) but with end to end credits and proper 'ACKs' the invalid frames *should* be rejected and retransmitted.
|

RC Denton
|
Posted - 2010.07.07 21:54:00 -
[126]
Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
Mirrored over to a segregated cluster on a different SAN with a witness?
|

Gnulpie
Minmatar Miner Tech
|
Posted - 2010.07.08 00:27:00 -
[127]
That is pretty awesome stuff here.
A commercial company talking about how they failed in detail etc.
Very open, very clear. Really nice to read (though I don't understand all of the tech-talk) and stunning to see such honest explanation to the community.
Mind you, CCP could have just be silent or could have released a short news-item about it ("oops - something went wrong"). But no, they go in detail and explain things, with really good follow-up feedback in the thread.
That is AWESOME! |

Triana
Gallente Aliastra
|
Posted - 2010.07.08 21:24:00 -
[128]
might be mistaken here as i have not a very deep knowledge of the structure of the DB , but have you considered using a filesystem with snapshot facility like ZFS for example ? Just an idea here, because i dont know anything similar in the windows world (i m a solaris admin) and back to the DB world i have worked with systems runnning Oracle, MySQL, Sybase, DB2, and mssql (though i m not a DBA so my knowledge of those is distant) however i dont remember anybody at my different works dissing mssql more than the others, they all have their strenght and being at the moment in a shop that run MSSQL on W2K3 for some pretty massive db (up to 7.5TB) i have yet to hear anyone complain about them (although as i said earlier on on the Solaris side of things)
anyway i feel for you guys, having been in my fair share of move problems and such, kudos for the effort -- War is like any other bad relationship. Of course you want out, but at what price? And perhaps more importantly, once you get out, will you be any better off? |

Erik Legant
|
Posted - 2010.07.09 10:29:00 -
[129]
CCP, thanks for this dev blog.
You did a good work recovering the database whithout doing a full restore.
That being said, the next time, I'd have nothing against more live news about what's going wrong.
Good luck ! -- Erik |

Elijaah Bailey
|
Posted - 2010.07.27 21:38:00 -
[130]
Originally by: Shintai
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
Troll detected. Or just someone who never worked with a DB or any serious DB atleast.
Hey my DB with 500 entries and nothing else around works all the time. CCP must suck!! 
Maybe he means an entreprise grade database server. The kind that MS-SQL is not ..
|
|
|
|
|
Pages: 1 2 3 4 5 :: [one page] |