EVE Search - New Dev Blog: TQ Move Outage Details

Check My IP Information

All Channels

EVE Information Portal

New Dev Blog: TQ Move Outage Details

» Click here to find additional results for this topic using Google

Monitor this thread via RSS [?]

Pages: 1 2 3 4 [5] :: one page

Author	Thread Statistics \| Show CCP posts - 13 post(s)
Koragoni SkyKnight	Posted - 2010.07.05 19:06:00 - [121] Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster." That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do. Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyà this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursà it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this) So even on a flawless run we'd have 8 hours of downtime each day. Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well. I never suggested a live backup of the database for every maintenance cycle. That isn't realistic given the 1 hour window. The server migration however, YOU defined the downtime requirements. That migration should have started with a backup, there is no excuse or logic that can change that fact. Also, if you're still doing hot backups to tape at this point you need even more help. As you've indicated validation of the backup on tape media takes too long. You need hard disk based storage large enough to house a single copy of tranqulity. This storage is only needed in the case of migration. Of course if you're looking into a SNAP solution you're already moving in this direction. Tape is only good for long term archival use, and I submit useless for the game server. You need that for your financial records, not for the game's database. Unless you guys are wanting to keep copies of the DB around for 10+ years for later study. Not an entirely worthless exercise, especially given the unique nature of the Tranquility cluster.
IngarNT Minmatar hirr Morsus Mihi	Posted - 2010.07.06 07:15:00 - [122] Errors on the san? its brocade right? it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch. if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM. Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path
Sjolus Metafarmers MeatSausage EXPRESS	Posted - 2010.07.06 11:31:00 - [123] Yokai, I have nothing other to add than huge, HUGE props for delivering tasty technical tidbits and information such as this. Especially regarding specifics around prolonged downtimes. This is, at least for me, VERY satisfying to read. Thank you for delivering actual information <3

CCP Yokai	Posted - 2010.07.06 12:34:00 - [124] Originally by: IngarNT Errors on the san? its brocade right? (YES) it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch. if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM. Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery...

IngarNT Minmatar hirr Morsus Mihi	Posted - 2010.07.06 18:47:00 - [125] Originally by: CCP Yokai Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery... It may be worth asking ram-san if they will support class 2 IO traffic, not meny people do (outside of flogi processes) but with end to end credits and proper 'ACKs' the invalid frames should be rejected and retransmitted.
RC Denton	Posted - 2010.07.07 21:54:00 - [126] Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster." That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do. Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyà this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursà it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this) So even on a flawless run we'd have 8 hours of downtime each day. Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well. Mirrored over to a segregated cluster on a different SAN with a witness?
Gnulpie Minmatar Miner Tech	Posted - 2010.07.08 00:27:00 - [127] That is pretty awesome stuff here. A commercial company talking about how they failed in detail etc. Very open, very clear. Really nice to read (though I don't understand all of the tech-talk) and stunning to see such honest explanation to the community. Mind you, CCP could have just be silent or could have released a short news-item about it ("oops - something went wrong"). But no, they go in detail and explain things, with really good follow-up feedback in the thread. That is AWESOME!
Triana Gallente Aliastra	Posted - 2010.07.08 21:24:00 - [128] might be mistaken here as i have not a very deep knowledge of the structure of the DB , but have you considered using a filesystem with snapshot facility like ZFS for example ? Just an idea here, because i dont know anything similar in the windows world (i m a solaris admin) and back to the DB world i have worked with systems runnning Oracle, MySQL, Sybase, DB2, and mssql (though i m not a DBA so my knowledge of those is distant) however i dont remember anybody at my different works dissing mssql more than the others, they all have their strenght and being at the moment in a shop that run MSSQL on W2K3 for some pretty massive db (up to 7.5TB) i have yet to hear anyone complain about them (although as i said earlier on on the Solaris side of things) anyway i feel for you guys, having been in my fair share of move problems and such, kudos for the effort -- War is like any other bad relationship. Of course you want out, but at what price? And perhaps more importantly, once you get out, will you be any better off?
Erik Legant	Posted - 2010.07.09 10:29:00 - [129] CCP, thanks for this dev blog. You did a good work recovering the database whithout doing a full restore. That being said, the next time, I'd have nothing against more live news about what's going wrong. Good luck ! -- Erik
Elijaah Bailey	Posted - 2010.07.27 21:38:00 - [130] Originally by: Shintai Originally by: Tanjia Guileless "What are we doing to prevent this?" Migrating to a serious database product? Troll detected. Or just someone who never worked with a DB or any serious DB atleast. Hey my DB with 500 entries and nothing else around works all the time. CCP must suck!! Maybe he means an entreprise grade database server. The kind that MS-SQL is not ..



Pages: 1 2 3 4 [5] :: one page
First page \| Previous page \| Next page \| Last page

COPYRIGHT NOTICE
EVE Online, the EVE logo, EVE and all associated logos and designs are the intellectual property of CCP hf. All artwork, screenshots, characters, vehicles, storylines, world facts or other recognizable features of the intellectual property relating to these trademarks are likewise the intellectual property of CCP hf. EVE Online and the EVE logo are the registered trademarks of CCP hf. All rights are reserved worldwide. All other trademarks are the property of their respective owners. CCP hf. has granted permission to EVE-Search.com to use EVE Online and all associated logos and designs for promotional and information purposes on its website but does not endorse, and is not in any way affiliated with, EVE-Search.com. CCP is in no way responsible for the content on or functioning of this website, nor can it be liable for any damage arising from the use of this website.