EVE Search - Sticky:Dev blog: Tranquility Tech III

Disco Dancer Dancing wrote:

Being someone that builds complex, large datacenters for both private and public use on a rather period basis, I'm not that impressed on the path of physical architecture that you are looking at. For whatever reason, why are you looking at a traditional, silo-based solution with storage and compute in different silos and traversing a "slow" FC link to make it work when not hitting the cache in RAM. Any particular reason why you are not looking on a more modern, flexible and scalable platform then the one described in the blogpost?

Seeing that everything not in the RAM-cache have to traverse the FC switch we can quickly give a few numbers on the actual latency and round-trip on several different ways of accessing data on different locations

L1 cache reference 0.5 ns
Branch Mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1KB with Zippy 3,000 ns
Sent 1KB over 1Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory250,000 ns 0.25 ms
Round trip within datacenter 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms, 4x memory
Disk seek 10,000,000 ns10 ms, 20x datacenter round trip
Read 1MB sequentially from disk 20,000,000 ns20 ms, 80x memory, 20x SSD
Send packet CA -> Netherlands -> CA150,000,000 ns150 ms

Looking at the figures, as soon as we start to traverse several layers we add up latency on the whole request, if we need to traverse the FC, to hit the storage nodes, then hit the disk and back, latency can rather quickly add up. Keeping the data as local as possible is the key, mainly in Memory, or as close to the node as possible without traversing the network (Sure, FC is stable, proven and gives a rather low latency, but from other standpoint you could argue that it is dead in the upcoming years as we are moving towards utilizing RDMA over Converged Fabrics or the like).

On another note, if we look on a mainstream enterprise SSD we can find a few figures:
500MB/s Read and 460MB/s Write
If we put these into the following calculation to see when we saturate a traditional storage network:
numSSD = ROUNDUP((numConnections * connBW (in GB/s))/ ssdBW (R or W))

We get the following table:
Network BW SSDs required to saturate network BW
Controller ConnectivityAvailable Network BWRead I/OWrite I/O
Dual 4Gb FC 8Gb == 1GB 2 3
Dual 8Gb FC 16Gb == 2GB 4 5
Dual 16Gb FC 32Gb == 4GB 8 9
Dual 1Gb ETH 2Gb == 0.25GB 1 1
Dual 10Gb ETH 20Gb == 2.5GB 5 6

This is without taking into account the roundtrip to access the data, and is counting with unlimited CPU power as this can also become saturated. We can see that we don't need that many SSD to saturate a network. Key point here, try and keep the data as local as possible, once again not traversing the network with the added latency and network limitation.

We can also do calculations on difference on hitting per say a local storage cache in the memory, or hitting a remote storage cache (Per say, SAN controller with caching). I know from the top of my head which are the fastest, key point, once again, keep the data as local as possible.

There are several technologies pin-pointing these issues that are seen in traditional silo datacenters, have you looked at any, and if so what is the reason that these do not fit your needs?