The Cerberus Project

Methods to Improve Performance

Cerberus underwent a wide variety of performance enhancing testing to ensure that all of the performance possible was being utilized from the available hardware.

As was previously mentioned, the project started with a cluster wide performance of 4 frames per second. In order to improve this result, massive testing went underway.

Hypervisor Implementation

While massive compute nodes are a backbone of the system, the code and images being processed could never take advantage of 100% of the compute node resources. This led to a process known as virtualizing, or splitting up compute resources to more than one operating system per bare metal server instance. Using VMWare ESXi initially and transitioning to Ovirt virtualization, several instances of operating systems could be spun up on each compute node, maximizing performance and system resources. This change alone allowed 2 compute nodes to server 8 VM's (virtual machines) independently, providing a total of 16 compute nodes to add to the cluster. This change alone allowed the project to gain massive performance increases. When testing concluded, the project could now process between 12 and 13 frames per second, a massive improvement.

Network Efficiency

In order to maximize the abilities of the network, some precautions typically made with network copying were removed. While services like Rsync and SCP process information and copy it reliably with decent bandwidth, more bandwidth was required even still. This led to a networking shift in which pitted performance over reliability. Since nodes on both sides of the transfer would be error checking, the extra time used to error check while copying is extraneous.

This led to the implementation of BBCP, an open source copying protocol that helps saturate a TCP connection to the destination directory. Using this on every single node allowed performance to increase from 8MB/sec copying to fully saturating 1GBe at speeds near 11MB/sec. This increase showed the true bottleneck that comes with 1GBe connections at every point in the network.