
Indeterminacy
THORN Syndicate Controlled Chaos
|
Posted - 2010.08.17 17:26:00 -
[2]
Originally by: CCP Warlock
Originally by: Indeterminacy Edited by: Indeterminacy on 17/08/2010 15:15:47 I see that CCP has been unable to maintain a test server which reflects their production system thereby enabling timely and effective problem solving.
I am not surprised. 'Test' servers are often given low priority. They are also difficult to maintain in a parallel computing environment.
You (CCP) have an advantage over many HPC shops. You have a single production environment (I hope). You have 1 compute node you have to mimic, one netowrk, etc.
Hopefully someone will be charged with maintaining an upgraded test infrastructure (both the hardware and human aspects).
edit:spelling
Realistic testing of large scale, real time, distributed systems has been a perennial problem for decades. The reality has always been that the live system is larger, and more complex than any realistic test system could be. Putting an individual server under load is doable, but testing a network of servers, and putting it under full capacity is much, much harder. To give the truly extreme example, where is the test system for the entire Internet?
Things have started to get a little better this decade with the hardware price drops and the data centers. The thin clients (which you'll be hearing about later this week) are a major step forward for us, and we're all really excited about the results we're going to get out of them. Even so, we have work in progress to improve our ability to handle and set up tests with large numbers of clients, in and of itself a non-trivial problem. For example, what sort of ship fitting should we setup for any given test? Then, cycle through a bunch of them, with the same test, and compare the results across a pretty large set of collected data on a large variety of system parameters. Which in an ideal world should also be done completely automatically.
Testing these kinds of systems, especially in terms of scaling and load limits is a set of problems in and of itself.
Agreed. Are you able to virtualize multiple instances of the thin client on a piece of hardware? Or run multiple instances of the process on a single node?
Originally by: CCP Oveur
To be clear, the hardware gap we're addressing now isn't our main problem with reflecting our production environment. It's the 50,000 people playing at the same time, where of 1000 might be trying to shoot each other in the face in the same solar system. So compared to that crucial part of replicating what happens to TQ in a controlled test environment, the hardware maintenance is but one of very many factors.
Also, agreed.
You're now devoting time to a infrastructure which won't get you more customers next week. You can't write a sexah blog about it (geeks and nerds excepted) and it creates more work for those involved for that non-instant gratification.
The complicating factors you've described are felt everywhere this is done. What parameters do I give my simulation? What file system [node] do I use for I/O? And so on.
From what I've seen however, many HPC outfits also run multiple variations (hardware and software) of compute, I/O, database nodes. Hopefully you do not have this specific problem which would only further complicate and already tough problem. I was kinda fishing for an answer to that I guess 
But in the long run once in place, with some maintenance, it's a huge benefit.
I'm not surprised about any of this as I've been on both ends of it. That is, the win and fail of test systems in a parallel / HPC environment.
|