Pages: 1 2 3 [4] 5 :: one page |
|
Author |
Thread Statistics | Show CCP posts - 13 post(s) |
Vultaras
|
Posted - 2010.07.01 12:14:00 -
[91]
Originally by: VaL Iscariot To be honest it made me sick to read all the people whining and complaining about how long it was taking. Everyone was given a full weeks warning that this downtime was going to happen, and to be prepared for it. Instead, the so called 'mature' player base of Eve Online was found to be no more then a bunch of World of Warcraft drop out whiners. It was only a day and a half and people were up in arms about "HOW DARE CCP TRY TO UPGRADE THE GAME I PLAY EVEN THOUGH THEY'RE DOING THE EXACT THING I WHINE ABOUT AND WISH THEY'D DO ON A DAILY BASIS!! I WANT FREE NAO!!1!" the worst part being that CCP obliged them, thus giving the minority that post on the forums more voice then what they deserve.
Next time CCP, dig through the f*cktards and just ignore them. Giving in to them only makes it worse. Also, don't be insulted by them either. They don't know what it takes. (though I'm sure your going to find a so called 'game developer for Call of Duty 4, and a few 'Blizzard programmers' in here too to tell you just how its done )
Thanks for all the fish VaL
Thank you, i can not say better. You are my hero
Damn you childish whiners
|
Libin Herobi
|
Posted - 2010.07.01 12:36:00 -
[92]
Originally by: CCP Yokai
Originally by: Libin Herobi Looking forward to not receiving any answers in this thread.
Patch day today.
I'll start grinding through responses after downtime today.
Thanks!
Well done! I can post this more often if you think it helps with gettings answers...
|
Nofonno
Amarr Aperture Harmonics K162
|
Posted - 2010.07.01 12:58:00 -
[93]
Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work...
Originally by: CCP Yokai
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking.
I'm stumped. ---
A scientist must be an optimist at heart - to have the strength to rally against a chorus of voices saying "it cannot be done". |
Ban Doga
|
Posted - 2010.07.01 13:02:00 -
[94]
Edited by: Ban Doga on 01/07/2010 13:02:21
Originally by: CCP Yokai
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
This is correct we did have our normal backup that was a few hours out of date. And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
What in the world made you believe you should start such an undertaking without having an up-to-date backup of your data?! Sorry to say this, but I guess a large part of the additional downtime last week was necessary because you guys did not have that; thus being unable to simply revert the DB.
If you do a once every other year procedure your first question before touching anything should always be "Do we have a brand-new backup?" But it's great to see you started to adopt this widely spread technique into your daily business.
|
|
CCP Yokai
|
Posted - 2010.07.01 13:26:00 -
[95]
Edited by: CCP Yokai on 01/07/2010 13:27:26 Edited by: CCP Yokai on 01/07/2010 13:27:07
Originally by: Nofonno Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work...
Originally by: CCP Yokai
The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches.
Once we did this... the errors went away on the SAN and we had storage normalized.
This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking.
I'm stumped.
It is pretty complicated and why it wasn't in the initial posting while we verified. But hundreds of thousands of both in and out of frame errors without crc errors on corrupted payload and you have the problem we had.
|
|
|
CCP Yokai
|
Posted - 2010.07.01 13:34:00 -
[96]
Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Originally by: Commander Azrael
Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17
Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there?
I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times.
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
|
|
Rashmika Clavain
Gallente
|
Posted - 2010.07.01 13:43:00 -
[97]
Well the main thing is you're adopting new or adapting existing processes to help mitigate future related problems when doing any form of planned outage.
|
Dokten Ral
|
Posted - 2010.07.01 14:05:00 -
[98]
Just wanted to say: THIS kind of great community service and open honest caring about the community is one of the things that I love most about all you guys at CCP, and is the kind of thing I tell anyone who's interested in EVE about. Not only is it fun and interesting to get a glimpse behind the scenes, but it's great to see that CCP really does care about providing us with great service. CCP certainly doesn't have to provide us with the kind of quality support they do, but I have yet to see anyone else anywhere near as good of a job.
|
Lyonsban
Gallente Wreakage R Us
|
Posted - 2010.07.01 16:36:00 -
[99]
Thanks for the SP. I didn't expect it, wouldn't have missed it, but appreciate it.
I'm glad to see you had and used a fallback plan to "nuke from orbit". You'd be amazed how many projects don't use decent risk management tools. My appreciation of your professionalism.
MSSQL? Bah, over-developed, under-planned desktop database. It's like using a successor to Windows 95 on your mainframe. Oh, wait...
|
Beauregard Jackson
Minmatar Old Bastards Club
|
Posted - 2010.07.01 17:14:00 -
[100]
I'll pile on with the kudos. Many thanks to CCP for being forthcoming and honest with what went wrong, and the extreme effort that went into fixing it. Compare that with Google, "Yeah, GMail barfed. It's back now. Carry on."
I know from experience, all-nighters in the server room is not much fun.
|
|
Darek Castigatus
Immortalis Inc. Shadow Cartel
|
Posted - 2010.07.01 17:35:00 -
[101]
understood pretty much nothing that yokai has said so far but appreciate him saying it anyway, now to go get my resident IT nerd friend to translate it for me. On a slightly more serious note i dont think ive ever seen any other MMO company be this open about their internal operations, major props to all CCPeople (hur hur, i made a funny )
Hopefully now it'll shut up those endless 'CCP never tells us anything' whiners
http://desusig.crumplecorn.com/sig.php |
Akita T
Caldari Caldari Navy Volunteer Task Force
|
Posted - 2010.07.02 03:34:00 -
[102]
Recovering from such a massive (and mostly hardware, from the looks of it) failure in less than a day (time from discovery of problem to complete fix with thorough enough testing, that is), that's actually pretty decent. Getting new procedures in place to make sure if such an unlikely event happens again, it will go even faster to fix it, that's even better. Nicer new home for TQ, awesome either way.
Can't say I was very pleased the server was down for so long, but all things considered, thumbs up for an overall good job.
_
Beginner's ISK making guide | Manufacturer's helper | All about reacting _
|
Paladin Taggart
|
Posted - 2010.07.02 04:16:00 -
[103]
Wow. First of all thanks for doing such a great job on such a massive system.
Also I want to praise your honesty. It would have been very easy to stick with a vague answer to all the questions. But instead you admitted that you had a large human error coupled with an ill-timed hardware failure. It takes a great amount of courage to tell us the truth.
<golf_clap>Thank you!</golf_clap>
|
Alain Kinsella
Minmatar
|
Posted - 2010.07.02 06:42:00 -
[104]
Thanks Yokai, very appreciated, especially to an SL migrant.
Quote:
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now.
I find this interesting, since (as a RamSan user) I did not think they had snap features (though we use the SSDs, not their Flash products). Are you using something else in-between?
BTW, we found a product that blew away the RamSan in speed - FusionIO. Any chance you'd drop those into the MS-SQL boxes, if only to enable better local caching? Or do the internal RamSan devices work OK in that respect (per last year's fanfest tech/vendor video)?
|
Pitty Hammerfist
|
Posted - 2010.07.02 07:38:00 -
[105]
Awesome work, must've been a very tense 48 hours.
i know the feeling of watching network screw up with people getting angry wondering why lol, but not to the extent of 400k people screaming :)
|
kKayron Jarvis
Caldari 5th Front enterprises Chain of Chaos
|
Posted - 2010.07.02 08:06:00 -
[106]
on the 6/25/2010 Open petitions Subject: lost 165 of mechanical parts
It looks like I have lost 165 or so of mechanical parts from **** It looks like my extractors that was doing base metals have chand into heavy metals extractors
it looks like it is down to this i thinks
|
|
CCP Yokai
|
Posted - 2010.07.02 11:09:00 -
[107]
Just wanted to throw out a quick "Thank you" to everyone for posting, replying, helping out. I do appreciate the comments and plan to keep delivering more info on future projects as soon as possible.
I'll be monitoring this thread a bit less than F5 every few hours now... so apologies if I'm delayed on future replies.
Thanks again CCP Yokai
|
|
Lykouleon
Trust Doesn't Rust
|
Posted - 2010.07.03 05:52:00 -
[108]
I'm suprised no one else has asked...
WHAT HAPPENED TO THE SHOE?!?!?!
Quote: CCP Mindstar > Sorry - I've completely messed all that up. lets try again
|
Esrevid Nekkeg
Justified and Ancient
|
Posted - 2010.07.03 14:07:00 -
[109]
Originally by: Lykouleon I'm suprised no one else has asked...
WHAT HAPPENED TO THE SHOE?!?!?!
Didn't you read? It blew out. It probably got overheated to much....
And to the people at CCP that where responsible for this job: I'm very, very pleased to see that you are honest and willing to learn from mistakes made. And above all, are not afraid to admit it! **takes hat off and bows**
|
Silicon Buddha
Amarr Capital Construction Research Pioneer Alliance
|
Posted - 2010.07.03 22:44:00 -
[110]
Originally by: CCP Yokai Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
Snaps are the way to go. We snap our huge production DBs in a 7x24 enterprise NOC every 4 hours. We don't even bother with tlog backups anymore as we have the fulls. SQL just freezes for a few seconds so that all the data can be committed to disk and then the snapshot is taken and the DB unfrozen. All told takes about 3-5 seconds of "freezing" and all our apps are at least that resilient.
Restores are quick as heck as well (fortunately only needed then a few times).
The downside to snaps is the enormous amount of disk space you need to store all those blocks, but it does work very well.
I'm sure any of us who love the enterprise geek stuff would LOVE to help out our favorite game if you ever want/need to bounce anything off people. _________________________________________________________ Click here for Fly Reckless Podcast
|
|
Koragoni SkyKnight
|
Posted - 2010.07.04 00:28:00 -
[111]
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
This made me raise an eyebrow... as a fellow IT professional, and one that has worked in this industry my entire professional career. Balancing against the fact that I work with small business almost exclusively and this type of hardware is beyond me... I can't help but think you have a massive procedural problem that needs addressed.
You never, and I mean NEVER, migrate a system without a full backup. I don't care if it takes 2 hours to copy that database, a copy should have been made before the new rack was brought online. If I made the mistake you guys made, costing any of my clients the money of repair... my employment would have been terminated faster than you can say, "oops".
I'm glad you guys got it sorted out, and the repair itself is a testament to your technical abilities. So I'm left confused as to how such a well put together crew of people could have fallen victim to such a basic mistake. Still you appear to be working on ensuring you learn from the mistake, and that is all anyone can really ask.
|
Rip Minner
Gallente ARMITAGE Logistics Salvage and Industries
|
Posted - 2010.07.04 01:53:00 -
[112]
Originally by: Chiana Torrou Contrary to many others who post on the forums I still think you all did a really good job in the face of very difficult circumstances.
Thanks for all the hard work - and the free skill points
This covers it Is it a rock? Point a Lazer at it and profit. Is it a ship? Point a Lazer at it and profit. I dont realy see any differnces here. |
Zwaliebaba
|
Posted - 2010.07.05 08:13:00 -
[113]
Originally by: Raquel Smith
Originally by: CCP Fallout As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day.
I work in IT and COMPLETELY UNDERSTAND unforseen happenings as a result of maintenances!
I've had a routine security update corrupt an entire LDAP database, it caused a week of instability and hassle.
Thanks for the blog post.
Yeah, but than you change the application logic, which was not the case here. Me was always told to make an offline backup before you change anything to prevent a roll-forward...
|
Salyan
|
Posted - 2010.07.05 09:20:00 -
[114]
Originally by: Koragoni SkyKnight
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Right now... about 2 hours.
You never, and I mean NEVER, migrate a system without a full backup.
This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them:
http://www.linuxtoday.com/high_performance/2009101901035NWMS
At least CCP had a better plan that microsoft!
|
Libin Herobi
|
Posted - 2010.07.05 11:48:00 -
[115]
Some things take longer to process than others.
Originally by: CCP Yokai
... And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run.
Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35
Originally by: Commander Azrael
Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity.
Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.
Originally by: CCP Fallout EVE Online: Tyrannis 1.0.3 will be deployed on Thursday, July 1, 2010. Deployment will start at 11:00 UTC and is expected to be completed at 12:30 UTC. Patch notes are available for review.
Originally by: FailSafe Kari Online and no extensions ^_^ I'm impressed CCP good Work
It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible...
|
wizard87
|
Posted - 2010.07.05 14:21:00 -
[116]
Edited by: wizard87 on 05/07/2010 14:25:17 New architecture: Have a failover SAN that also updates during DT for additional data ?
New procedure (amazed if its not roughly your normal procedure): Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster.
You suggest your backup was not up to date (or checked for data integrity) by the "burritos" analagy - that would be pretty criminal for a system of hundreds of thousands of users. Tell me it isn't so.
Or it sounds like you were 1 step from a failed disaster recover and bye bye eve data. I wonder if that kind of mismanagement is how Eve will end one day?
Ps - I was a server admin quite a few years back for only a couple of years so may not know what I'm on about with today's hardware.
|
|
CCP Yokai
|
Posted - 2010.07.05 16:17:00 -
[117]
"Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
|
|
|
CCP Yokai
|
Posted - 2010.07.05 16:21:00 -
[118]
Originally by: Libin Herobi Some things take longer to process than others.
It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible...
The backup was started at 09:00GMT and the nature of the backup we are doing backs up and adds changes live to the backup. It was completed at about 11:05GMT. Backups do not have to start after everyone leaves the game. So, yes, we did a complete backup before the Tyrannis 1.0.3 patch.
I hope that helps clarify.
|
|
wizard87
|
Posted - 2010.07.05 17:04:00 -
[119]
So you only do incremental backups during DT normally you mean?
Sounds like some of the clients I used to work for - daily incrementals and weekly full backups but I always hated those incrementals cos the data often (inevitably) used to be buggered when you needed it most.
Thanks for being so open about it anyway. I don't know enough about the solution your using really so I'm basically a guy down the pub offering advice.
However I know datacentre cabinets dont come cheap so space may be an issue, but have you considered having redundancy (I think it used to be called) in terms of a failover DB/back end you can flip to so customers can still use the application and the outages dont keep the customers waiting. You can then sync the data (or flip back to the primary DB/back end assuming your using pretty standard 3-tier architecture) effectively from the next backup/DT after the issues/upgrades/moves etc are resolved?
|
Some Advisor
|
Posted - 2010.07.05 18:17:00 -
[120]
Originally by: Salyan This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them:
http://www.linuxtoday.com/high_performance/2009101901035NWMS
At least CCP had a better plan that microsoft!
wow. thats quite deep and lengthy, and very interesting
--- Donations, thankyou / hatemails always welcome :P if you want to "ragequit" or take a longer break: "can i have your stuff" ? :P i also like BPOs of any kind with the promise you get it back :) |
|
|
|
|
Pages: 1 2 3 [4] 5 :: one page |
First page | Previous page | Next page | Last page |