The Cerberus Project

Chapter 4.2:

Provisioning the Cluster

Provisioning the cluster is an indepth process that consists of taking a brand new machine (ie. in its "bare metal" state) and bringing it entirely up to speed by installing an Operating System and all of the updates, software packages, programs and kernels necessary to allow it to be a functioning node of any role in the cluster. This process is handled in a variety of different ways.

Operating System Provisioning

Provisioning each server from it's bare metal state to booting it with an operating system is a relatively straightforward task in CentOS. Using a specific software repository headnode, known as the Logcat, each machine is configured to boot using the network.

This process is referred to as using PXE boot and allows the server to directly contact the software provisioning server to pull the correct OS image for booting. This process takes some time but uses a previously generated configuration file to efficiently automate the installation process. Users, services and drive configurations are all automated and the hostname of the machine is even set. Once all of this is confirmed, the machine is left to boot into CentOS on it's own, and then notify the head admin when OS configuration is all set.

This process is excellent at allowing full control for each type of node to be provisioned. The entire ordeal also allows either physical machines or VM's to be provisioned quickly and easy. Images for the OS exist for all node types, whether a new headnode, storage node or compute node is entering the cluster. In additional to this, customized initialization scripts exist for each type of node and are run accordingly once the install of the OS is complete.

The process concludes by manually installing Ansible from a configuration file hosted within the Image for the inbound Updates provisioning.

Updates, Packages and Others

This process is an in-depth operation using the previously mentioned Ansible service to finalize each node by installing the proper packages and dependencies on each node to properly service the Cerberus program.

Ansible, as previously mentioned, operates both in ad-hoc and playbook mode. This process is entirely driven by previously configured playbooks. These playbooks execute a variety of operations and scripts to properly configure each node with all of the software that it needs. Generally, a flow is maintained no matter what type of node is being used.

It is important to note, however, that there are specific playbooks that are maintained based on the type of node that is being provisioned. For example, compute nodes requires much different software packages than a new headnode coming into production. Playbooks are written to reflect these differences.

Playbooks are written in the YAML programming language and allow a wide variety of tasks to be accomplished. For instance, a single playbook can test ping, verify hostname, write the correct vales to a log file and then install packages using YUM or update other existing packages all at once.

Generally, the typical playbook flow is as follows:

  1. Check for Ping on the machine
  2. Check to ensure the Hostname matches the expected hostname
  3. Update the system with ALL available CentOS updates
  4. Reboot
  5. Install all specified system programs
  6. Reboot
  7. Install all specified Cerberus dependencies
  8. Check for updates
  9. Echo success and notify admin

This flow makes sure that each round of system updates has a full cycle to adjust and ensure system stability before moving on to the next rount. It is important to note that this playbook runs after a full system run script that correctly sets the user and admin priveleges. This is covered outside of Ansible and therefore is only mentioned in passing as a result.

Once this process is completed, the system notifies the head admin that it is ready to be considered production ready. This is handled through the Linux command sendmail and allows all updates to be monitored remotely.

Services and Other

In addition to initial provisioning, each machine is configured to be remotely monitored by Zabbix. This means that once a machine is brought up, it can be monitored remotely to check load and configuration details. In addition to this, a series of scripts were written to automatically inventory and asset tag the system and add it to the proper asset management system.

In order to allow full control remotely, services like VNC and remote desktop are configured as well. This allows X server instances to be routed to the host from each machine if required.

Drive shares and NFS are also set up. Using another set of scripts, the general high performance shares from the headnode are created and mounted to the node being provisioned. Once this is completed, all Git and other code repositories are now present on the machine and the rest of the provisioning required can take place. Once information is uploaded to a share, any other node connected to that share reflects the changes. This is a major component of keeping the cluster synchronized.