Pages: 1 2 3 4 [5] :: one page |
|
Author |
Thread Statistics | Show CCP posts - 13 post(s) |

Koragoni SkyKnight
|
Posted - 2010.07.05 19:06:00 -
[121]
Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
I never suggested a live backup of the database for every maintenance cycle. That isn't realistic given the 1 hour window. The server migration however, YOU defined the downtime requirements. That migration should have started with a backup, there is no excuse or logic that can change that fact.
Also, if you're still doing hot backups to tape at this point you need even more help. As you've indicated validation of the backup on tape media takes too long. You need hard disk based storage large enough to house a single copy of tranqulity. This storage is only needed in the case of migration. Of course if you're looking into a SNAP solution you're already moving in this direction.
Tape is only good for long term archival use, and I submit useless for the game server. You need that for your financial records, not for the game's database. Unless you guys are wanting to keep copies of the DB around for 10+ years for later study. Not an entirely worthless exercise, especially given the unique nature of the Tranquility cluster.
|

IngarNT
Minmatar hirr Morsus Mihi
|
Posted - 2010.07.06 07:15:00 -
[122]
Errors on the san? its brocade right?
it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch.
if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM.
Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path |

Sjolus
Metafarmers MeatSausage EXPRESS
|
Posted - 2010.07.06 11:31:00 -
[123]
Yokai, I have nothing other to add than huge, HUGE props for delivering tasty technical tidbits and information such as this. Especially regarding specifics around prolonged downtimes. This is, at least for me, VERY satisfying to read.
Thank you for delivering actual information <3 |
|

CCP Yokai

|
Posted - 2010.07.06 12:34:00 -
[124]
Originally by: IngarNT Errors on the san? its brocade right? (YES)
it will be CRC errors, it all ways is, or a gammey SFP - in which case you need to set the tin up to auto-block a port over a certain threshold of errors outside of frame and cut an alert - dropping a (redundant) link is a lot better then letting it vomit broken data into the downstream switch.
if its crc errors i have a perl script which poll and reports on them pretty well. your welcome to it, as free in considerably cheaper then the mountain of cash you need to license DCFM.
Alase, unless your ops travel at the speed of light even if you can spot a CRC error you wont be able ton intersept it before it hits disk - so yeah, better databse structure to reduce recovery time is probably the best fix path
Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery... |
|

IngarNT
Minmatar hirr Morsus Mihi
|
Posted - 2010.07.06 18:47:00 -
[125]
Originally by: CCP Yokai
Bonus points for getting it :) Hard stuff... corrupted before you can do anything about it. Best plan... better recovery...
It may be worth asking ram-san if they will support class 2 IO traffic, not meny people do (outside of flogi processes) but with end to end credits and proper 'ACKs' the invalid frames *should* be rejected and retransmitted.
|

RC Denton
|
Posted - 2010.07.07 21:54:00 -
[126]
Originally by: CCP Yokai "Kick out users -> take back-ups -> check integrity of backups -> upgrade -> test -> Rollback to CURRENT backup if needed / or online upgraded cluster."
That sounds like the right thing to do on a small or simple DB that can be down a lot longer. On TQ (and again with the current hardware/design)... let me give you the picture of what this would do.
Kick out users -> 11:00 GMT Take back-ups -> 13:00 GMT complete (as stated previouslyą this takes 2 hours) Check integrity of backups -> 18:00 (checkDB on an uncorrupted DB in our case takes 5 hoursą it takes up to 24 hours on heavily corrupted DB like the one we dealt with during the outage) Test -> 19:00 (giving QA some time to make sure it works... hard to check without doing this)
So even on a flawless run we'd have 8 hours of downtime each day.
Again, not a bad method in some cases... but given the size, complexity, and demand for the availability of this DB it really needs more of a live replicated disaster recovery solution instead. Once we have the solutions worked out I'll start a new thread on that. Thanks again for the input from everyone... just thought I'd give a bit of feedback on suggestions as well.
Mirrored over to a segregated cluster on a different SAN with a witness?
|

Gnulpie
Minmatar Miner Tech
|
Posted - 2010.07.08 00:27:00 -
[127]
That is pretty awesome stuff here.
A commercial company talking about how they failed in detail etc.
Very open, very clear. Really nice to read (though I don't understand all of the tech-talk) and stunning to see such honest explanation to the community.
Mind you, CCP could have just be silent or could have released a short news-item about it ("oops - something went wrong"). But no, they go in detail and explain things, with really good follow-up feedback in the thread.
That is AWESOME! |

Triana
Gallente Aliastra
|
Posted - 2010.07.08 21:24:00 -
[128]
might be mistaken here as i have not a very deep knowledge of the structure of the DB , but have you considered using a filesystem with snapshot facility like ZFS for example ? Just an idea here, because i dont know anything similar in the windows world (i m a solaris admin) and back to the DB world i have worked with systems runnning Oracle, MySQL, Sybase, DB2, and mssql (though i m not a DBA so my knowledge of those is distant) however i dont remember anybody at my different works dissing mssql more than the others, they all have their strenght and being at the moment in a shop that run MSSQL on W2K3 for some pretty massive db (up to 7.5TB) i have yet to hear anyone complain about them (although as i said earlier on on the Solaris side of things)
anyway i feel for you guys, having been in my fair share of move problems and such, kudos for the effort -- War is like any other bad relationship. Of course you want out, but at what price? And perhaps more importantly, once you get out, will you be any better off? |

Erik Legant
|
Posted - 2010.07.09 10:29:00 -
[129]
CCP, thanks for this dev blog.
You did a good work recovering the database whithout doing a full restore.
That being said, the next time, I'd have nothing against more live news about what's going wrong.
Good luck ! -- Erik |

Elijaah Bailey
|
Posted - 2010.07.27 21:38:00 -
[130]
Originally by: Shintai
Originally by: Tanjia Guileless "What are we doing to prevent this?"
Migrating to a serious database product?
Troll detected. Or just someone who never worked with a DB or any serious DB atleast.
Hey my DB with 500 entries and nothing else around works all the time. CCP must suck!! 
Maybe he means an entreprise grade database server. The kind that MS-SQL is not ..
|
|
|
|
|
Pages: 1 2 3 4 [5] :: one page |
First page | Previous page | Next page | Last page |