EVE Search - New Dev Blog: TQ Move Outage Details

All Channels

EVE Information Portal

New Dev Blog: TQ Move Outage Details

» Click here to find additional results for this topic using Google

Monitor this thread via RSS [?]

Pages: 1 2 3 [4] 5 :: one page
Author	Thread Statistics \| Show CCP posts - 13 post(s)
Vultaras	Posted - 2010.07.01 12:14:00 - [91] Originally by: VaL Iscariot To be honest it made me sick to read all the people whining and complaining about how long it was taking. Everyone was given a full weeks warning that this downtime was going to happen, and to be prepared for it. Instead, the so called 'mature' player base of Eve Online was found to be no more then a bunch of World of Warcraft drop out whiners. It was only a day and a half and people were up in arms about "HOW DARE CCP TRY TO UPGRADE THE GAME I PLAY EVEN THOUGH THEY'RE DOING THE EXACT THING I WHINE ABOUT AND WISH THEY'D DO ON A DAILY BASIS!! I WANT FREE NAO!!1!" the worst part being that CCP obliged them, thus giving the minority that post on the forums more voice then what they deserve. Next time CCP, dig through the f*cktards and just ignore them. Giving in to them only makes it worse. Also, don't be insulted by them either. They don't know what it takes. (though I'm sure your going to find a so called 'game developer for Call of Duty 4, and a few 'Blizzard programmers' in here too to tell you just how its done ) Thanks for all the fish VaL Thank you, i can not say better. You are my hero Damn you childish whiners
Libin Herobi	Posted - 2010.07.01 12:36:00 - [92] Originally by: CCP Yokai Originally by: Libin Herobi Looking forward to not receiving any answers in this thread. Patch day today. I'll start grinding through responses after downtime today. Thanks! Well done! I can post this more often if you think it helps with gettings answers...
Nofonno Amarr Aperture Harmonics K162	Posted - 2010.07.01 12:58:00 - [93] Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work... Originally by: CCP Yokai The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches. Once we did this... the errors went away on the SAN and we had storage normalized. This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking. I'm stumped. --- A scientist must be an optimist at heart - to have the strength to rally against a chorus of voices saying "it cannot be done".
Ban Doga	Posted - 2010.07.01 13:02:00 - [94] Edited by: Ban Doga on 01/07/2010 13:02:21 Originally by: CCP Yokai Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17 Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there? I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times. This is correct we did have our normal backup that was a few hours out of date. And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run. What in the world made you believe you should start such an undertaking without having an up-to-date backup of your data?! Sorry to say this, but I guess a large part of the additional downtime last week was necessary because you guys did not have that; thus being unable to simply revert the DB. If you do a once every other year procedure your first question before touching anything should always be "Do we have a brand-new backup?" But it's great to see you started to adopt this widely spread technique into your daily business.

CCP Yokai	Posted - 2010.07.01 13:26:00 - [95] Edited by: CCP Yokai on 01/07/2010 13:27:26 Edited by: CCP Yokai on 01/07/2010 13:27:07 Originally by: Nofonno Edited by: Nofonno on 01/07/2010 13:02:02 Edited by: Nofonno on 01/07/2010 13:01:25 EDIT: I fail at posting from work... Originally by: CCP Yokai The exact bit that failed cannot be identified because frankly rather than tinkering with every link, transceiver, and switch port on the route I just nuked it form orbit. We replaced the fiber, moved to a new pair of transceivers, new port on patch panels, and even a new port on the switches. Once we did this... the errors went away on the SAN and we had storage normalized. This made me really goggle-eyed. I never seen or even heard this possible. Layer 1 error are usualy handled easily by the OS driver, there shouldn't be data corruption possible. Even if FC didn't detect the errors (which it theoretically could miss), it encapsulates SCSI, which has rigorous error checking. I'm stumped. It is pretty complicated and why it wasn't in the initial posting while we verified. But hundreds of thousands of both in and out of frame errors without crc errors on corrupted payload and you have the problem we had.


CCP Yokai	Posted - 2010.07.01 13:34:00 - [96] Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35 Originally by: Commander Azrael Originally by: T'ealk O'Neil Edited by: T''ealk O''Neil on 30/06/2010 15:26:17 Originally by: Commander Azrael Apart from a DB backup being massive, they did back it up. If you read the dev blog they chose the lengthier option of fixing the corrupted entries instead of rolling back. Which do you prefer? An extended downtime? or logging in to find ISK missing from your missions you ran and that shiny ship you bought no longer there? I suggest you re-read. They had A backup, but if they had taken a backup as the first step before starting any work then no isk would have been lost as nobody would have been logged in between those times. Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity. Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned.

Rashmika Clavain Gallente	Posted - 2010.07.01 13:43:00 - [97] Well the main thing is you're adopting new or adapting existing processes to help mitigate future related problems when doing any form of planned outage.
Dokten Ral	Posted - 2010.07.01 14:05:00 - [98] Just wanted to say: THIS kind of great community service and open honest caring about the community is one of the things that I love most about all you guys at CCP, and is the kind of thing I tell anyone who's interested in EVE about. Not only is it fun and interesting to get a glimpse behind the scenes, but it's great to see that CCP really does care about providing us with great service. CCP certainly doesn't have to provide us with the kind of quality support they do, but I have yet to see anyone else anywhere near as good of a job.
Lyonsban Gallente Wreakage R Us	Posted - 2010.07.01 16:36:00 - [99] Thanks for the SP. I didn't expect it, wouldn't have missed it, but appreciate it. I'm glad to see you had and used a fallback plan to "nuke from orbit". You'd be amazed how many projects don't use decent risk management tools. My appreciation of your professionalism. MSSQL? Bah, over-developed, under-planned desktop database. It's like using a successor to Windows 95 on your mainframe. Oh, wait...
Beauregard Jackson Minmatar Old Bastards Club	Posted - 2010.07.01 17:14:00 - [100] I'll pile on with the kudos. Many thanks to CCP for being forthcoming and honest with what went wrong, and the extreme effort that went into fixing it. Compare that with Google, "Yeah, GMail barfed. It's back now. Carry on." I know from experience, all-nighters in the server room is not much fun.
Darek Castigatus Immortalis Inc. Shadow Cartel	Posted - 2010.07.01 17:35:00 - [101] understood pretty much nothing that yokai has said so far but appreciate him saying it anyway, now to go get my resident IT nerd friend to translate it for me. On a slightly more serious note i dont think ive ever seen any other MMO company be this open about their internal operations, major props to all CCPeople (hur hur, i made a funny ) Hopefully now it'll shut up those endless 'CCP never tells us anything' whiners http://desusig.crumplecorn.com/sig.php
Akita T Caldari Caldari Navy Volunteer Task Force	Posted - 2010.07.02 03:34:00 - [102] Recovering from such a massive (and mostly hardware, from the looks of it) failure in less than a day (time from discovery of problem to complete fix with thorough enough testing, that is), that's actually pretty decent. Getting new procedures in place to make sure if such an unlikely event happens again, it will go even faster to fix it, that's even better. Nicer new home for TQ, awesome either way. Can't say I was very pleased the server was down for so long, but all things considered, thumbs up for an overall good job. _ Beginner's ISK making guide \| Manufacturer's helper \| All about reacting _
Paladin Taggart	Posted - 2010.07.02 04:16:00 - [103] Wow. First of all thanks for doing such a great job on such a massive system. Also I want to praise your honesty. It would have been very easy to stick with a vague answer to all the questions. But instead you admitted that you had a large human error coupled with an ill-timed hardware failure. It takes a great amount of courage to tell us the truth. <golf_clap>Thank you!</golf_clap>
Alain Kinsella Minmatar	Posted - 2010.07.02 06:42:00 - [104] Thanks Yokai, very appreciated, especially to an SL migrant. Quote: Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. I find this interesting, since (as a RamSan user) I did not think they had snap features (though we use the SSDs, not their Flash products). Are you using something else in-between? BTW, we found a product that blew away the RamSan in speed - FusionIO. Any chance you'd drop those into the MS-SQL boxes, if only to enable better local caching? Or do the internal RamSan devices work OK in that respect (per last year's fanfest tech/vendor video)?
Pitty Hammerfist	Posted - 2010.07.02 07:38:00 - [105] Awesome work, must've been a very tense 48 hours. i know the feeling of watching network screw up with people getting angry wondering why lol, but not to the extent of 400k people screaming :)
kKayron Jarvis Caldari 5th Front enterprises Chain of Chaos	Posted - 2010.07.02 08:06:00 - [106] on the 6/25/2010 Open petitions Subject: lost 165 of mechanical parts It looks like I have lost 165 or so of mechanical parts from **** It looks like my extractors that was doing base metals have chand into heavy metals extractors it looks like it is down to this i thinks

CCP Yokai	Posted - 2010.07.02 11:09:00 - [107] Just wanted to throw out a quick "Thank you" to everyone for posting, replying, helping out. I do appreciate the comments and plan to keep delivering more info on future projects as soon as possible. I'll be monitoring this thread a bit less than F5 every few hours now... so apologies if I'm delayed on future replies. Thanks again CCP Yokai

Lykouleon Trust Doesn't Rust	Posted - 2010.07.03 05:52:00 - [108] I'm suprised no one else has asked... WHAT HAPPENED TO THE SHOE?!?!?! Quote: CCP Mindstar > Sorry - I've completely messed all that up. lets try again
Esrevid Nekkeg Justified and Ancient	Posted - 2010.07.03 14:07:00 - [109] Originally by: Lykouleon I'm suprised no one else has asked... WHAT HAPPENED TO THE SHOE?!?!?! Didn't you read? It blew out. It probably got overheated to much.... And to the people at CCP that where responsible for this job: I'm very, very pleased to see that you are honest and willing to learn from mistakes made. And above all, are not afraid to admit it! takes hat off and bows
Silicon Buddha Amarr Capital Construction Research Pioneer Alliance	Posted - 2010.07.03 22:44:00 - [110] Originally by: CCP Yokai Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned. Snaps are the way to go. We snap our huge production DBs in a 7x24 enterprise NOC every 4 hours. We don't even bother with tlog backups anymore as we have the fulls. SQL just freezes for a few seconds so that all the data can be committed to disk and then the snapshot is taken and the DB unfrozen. All told takes about 3-5 seconds of "freezing" and all our apps are at least that resilient. Restores are quick as heck as well (fortunately only needed then a few times). The downside to snaps is the enormous amount of disk space you need to store all those blocks, but it does work very well. I'm sure any of us who love the enterprise geek stuff would LOVE to help out our favorite game if you ever want/need to bounce anything off people. _________________________________________________________ Click here for Fly Reckless Podcast
Koragoni SkyKnight	Posted - 2010.07.04 00:28:00 - [111] Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35 Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned. This made me raise an eyebrow... as a fellow IT professional, and one that has worked in this industry my entire professional career. Balancing against the fact that I work with small business almost exclusively and this type of hardware is beyond me... I can't help but think you have a massive procedural problem that needs addressed. You never, and I mean NEVER, migrate a system without a full backup. I don't care if it takes 2 hours to copy that database, a copy should have been made before the new rack was brought online. If I made the mistake you guys made, costing any of my clients the money of repair... my employment would have been terminated faster than you can say, "oops". I'm glad you guys got it sorted out, and the repair itself is a testament to your technical abilities. So I'm left confused as to how such a well put together crew of people could have fallen victim to such a basic mistake. Still you appear to be working on ensuring you learn from the mistake, and that is all anyone can really ask.
Rip Minner Gallente ARMITAGE Logistics Salvage and Industries	Posted - 2010.07.04 01:53:00 - [112] Originally by: Chiana Torrou Contrary to many others who post on the forums I still think you all did a really good job in the face of very difficult circumstances. Thanks for all the hard work - and the free skill points This covers it Is it a rock? Point a Lazer at it and profit. Is it a ship? Point a Lazer at it and profit. I dont realy see any differnces here.
Zwaliebaba	Posted - 2010.07.05 08:13:00 - [113] Originally by: Raquel Smith Originally by: CCP Fallout As you know, CCP moved the Tranquility servers to a much larger and cooler server room and added new switches in the process. The downtime took longer than expected. CCP Yokai's newest dev blog fills us in on the events of the day. I work in IT and COMPLETELY UNDERSTAND unforseen happenings as a result of maintenances! I've had a routine security update corrupt an entire LDAP database, it caused a week of instability and hassle. Thanks for the blog post. Yeah, but than you change the application logic, which was not the case here. Me was always told to make an offline backup before you change anything to prevent a roll-forward...
Salyan	Posted - 2010.07.05 09:20:00 - [114] Originally by: Koragoni SkyKnight Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35 Right now... about 2 hours. You never, and I mean NEVER, migrate a system without a full backup. This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them: http://www.linuxtoday.com/high_performance/2009101901035NWMS At least CCP had a better plan that microsoft!
Libin Herobi	Posted - 2010.07.05 11:48:00 - [115] Some things take longer to process than others. Originally by: CCP Yokai ... And yes, having the backup run to after Down Time is the right way to do it... and what we are doing every time now. For todays client patch we did this too. So from this point forward we should have a full copy of the DB at a point where no transactions need to be run. Originally by: CCP Yokai Edited by: CCP Yokai on 01/07/2010 13:34:57 Edited by: CCP Yokai on 01/07/2010 13:34:35 Originally by: Commander Azrael Perhaps a Dev could let us know how long it takes to do a full TQ DB backup, out of curisoity. Right now... about 2 hours. But ee are working on making this much shorter... if not snap replicated now. Part of the redesign project mentioned. Originally by: CCP Fallout EVE Online: Tyrannis 1.0.3 will be deployed on Thursday, July 1, 2010. Deployment will start at 11:00 UTC and is expected to be completed at 12:30 UTC. Patch notes are available for review. Originally by: FailSafe Kari Online and no extensions ^_^ I'm impressed CCP good Work It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible...
wizard87	Posted - 2010.07.05 14:21:00 - [116] Edited by: wizard87 on 05/07/2010 14:25:17 New architecture: Have a failover SAN that also updates during DT for additional data ? New procedure (amazed if its not roughly your normal procedure): Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster. You suggest your backup was not up to date (or checked for data integrity) by the "burritos" analagy - that would be pretty criminal for a system of hundreds of thousands of users. Tell me it isn't so. Or it sounds like you were 1 step from a failed disaster recover and bye bye eve data. I wonder if that kind of mismanagement is how Eve will end one day? Ps - I was a server admin quite a few years back for only a couple of years so may not know what I'm on about with today's hardware.

CCP Yokai	Posted - 2010.07.05 16:17:00 - [117] "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster." That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do. Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyà this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursà it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this) So even on a flawless run we'd have 8 hours of downtime each day. Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.


CCP Yokai	Posted - 2010.07.05 16:21:00 - [118] Originally by: Libin Herobi Some things take longer to process than others. It looks like you did a 2 hour backup and applied the Tyrannis 1.0.3 patch in a 1.5 hour downtime on July 1st. That is truely remarkable. Some would even say it's impossible... The backup was started at 09:00GMT and the nature of the backup we are doing backs up and adds changes live to the backup. It was completed at about 11:05GMT. Backups do not have to start after everyone leaves the game. So, yes, we did a complete backup before the Tyrannis 1.0.3 patch. I hope that helps clarify.

wizard87	Posted - 2010.07.05 17:04:00 - [119] So you only do incremental backups during DT normally you mean? Sounds like some of the clients I used to work for - daily incrementals and weekly full backups but I always hated those incrementals cos the data often (inevitably) used to be buggered when you needed it most. Thanks for being so open about it anyway. I don't know enough about the solution your using really so I'm basically a guy down the pub offering advice. However I know datacentre cabinets dont come cheap so space may be an issue, but have you considered having redundancy (I think it used to be called) in terms of a failover DB/back end you can flip to so customers can still use the application and the outages dont keep the customers waiting. You can then sync the data (or flip back to the primary DB/back end assuming your using pretty standard 3-tier architecture) effectively from the next backup/DT after the issues/upgrades/moves etc are resolved?
Some Advisor	Posted - 2010.07.05 18:17:00 - [120] Originally by: Salyan This whole thing reminded me of the trouble Microsoft had with losing customer data on their "sidekick" phones. The following link describes (in rumors only) what happened to them: http://www.linuxtoday.com/high_performance/2009101901035NWMS At least CCP had a better plan that microsoft! wow. thats quite deep and lengthy, and very interesting --- Donations, thankyou / hatemails always welcome :P if you want to "ragequit" or take a longer break: "can i have your stuff" ? :P i also like BPOs of any kind with the promise you get it back :)

Pages: 1 2 3 [4] 5 :: one page
First page \| Previous page \| Next page \| Last page

COPYRIGHT NOTICE
EVE Online, the EVE logo, EVE and all associated logos and designs are the intellectual property of CCP hf. All artwork, screenshots, characters, vehicles, storylines, world facts or other recognizable features of the intellectual property relating to these trademarks are likewise the intellectual property of CCP hf. EVE Online and the EVE logo are the registered trademarks of CCP hf. All rights are reserved worldwide. All other trademarks are the property of their respective owners. CCP hf. has granted permission to EVE-Search.com to use EVE Online and all associated logos and designs for promotional and information purposes on its website but does not endorse, and is not in any way affiliated with, EVE-Search.com. CCP is in no way responsible for the content on or functioning of this website, nor can it be liable for any damage arising from the use of this website.