Best Practices for Monitoring and Measuring Data Center Performance
IT professionals are acutely aware just how closely tied their data center infrastructure performance is to their business performance in our digitally-driven world. Technology consumers – employees, suppliers, customers, and prospects – expect highly available, fast and responsive interactions from all the systems they touch. As a result, IT professionals have critical roles in empowering the strategic success and tactical effectiveness of many businesses today. Accordingly, it is vital for IT to know what hardware and software metrics to monitor, and to understand how these metrics relate to each other. This enables IT to continuously optimize the infrastructure that empowers businesses to achieve their goals and objectives.
In addition to knowing what metrics to monitor, cloud administrators often conduct before/after and A/B tests on pre-optimized resources to compare these metrics with metrics from production infrastructure. These tests measure the effectiveness of tuning strategies and performance solutions. In public clouds, it is simple and cost-effective to provision such testing resources.
The tests and metrics used to monitor the productivity of IT infrastructure are generally grouped into three categories; quantity measures, quality measures, and responsiveness measures. These groups are applied to every layer of the IT infrastructure stack; from operating systems, CPUs, storage tiers, and networks; to the efficiency and effectiveness of application code, computing services, and databases.
- Quantity measures track the amount of work being done by some component of the infrastructure stack. These measures are referred to as “throughput” metrics, and they are usually represented by an absolute number for some unit of time. For an application, throughput is generally measured by the number of concurrent processes managed per minute or second; whereas throughput for a database server is often represented by the number of queries executed per second. For a web server, the number of client requests successfully processed per second is a common measure of throughput.
- Quality measures look at the success or failure of process and application (workload) operations. For those executed correctly, the metrics represent the percentage of total work that is processed successfully. Error metrics, in comparison, capture the number of failed or erroneous results. They are commonly expressed as an error rate for some unit of time, or they are normalized by the process’s throughput to yield the number of errors per a unit of work.
- Responsiveness measures quantify how efficiently an infrastructure component completes its work. In essence, the speed of an end-to-end operation. Such measures are generally referred to as “latency” metrics; and they are usually expressed as an average or as a percentile of processing time. Latency might measure the time when a client issues a transaction until it receives a response, or it might measure when a database receives the request until it queues its response. As an example, latency is often shown as the percentage of operations completed within a unit of time, such as “97% returned within 0.3 seconds.”
The challenge in monitoring these metrics is that the performance of multiple infrastructure components is interrelated. Network capacities and speed, the number of cores and power of CPU’s, the efficiency of application code, the levels of contentions for shared computing resources; and the various configurations of hypervisors, databases, and other computing services can all impact performance capabilities. As a result, focusing on just one layer of the data center infrastructure stack without considering the multi-dimensional impact it has on the others, can negate the effectiveness of performance solutions and tuning strategies. Accordingly, multiple metrics are monitored from each group.
Therefore, it is very helpful to use application and system monitoring tools to stay ahead of potential issues. These tools provide alerts to application and hardware problems, often before they are noticed by end users. Lists of various monitoring tools can be found here and here.
So, what are these tools measuring and monitoring?
As you know, computer systems have several types of physical resources – CPU, volatile memory, network, and persistent storage – which collectively affect data center performance. Those resources also impact application performance as well. And, it is the level of application performance that determines how the data center is judged in achieving its strategic performance goals and objectives. …a data center with low operating costs and efficient power usage is still considered a failure if it cannot protect its data or meet its applications’ quantity, quality and responsiveness targets…
Consequently, monitoring tools continually measure the data center’s:
- Most demanding workloads (applications and processes) that are impacting physical resources
- Physical resource availability to run additional workloads (measures of densities and productivity)
- Current and historic usage patterns of file systems by their various applications and processes
- Current and historic status over time of CPU, memory, storage and network IO of various workloads
The objective is to uncover any impediments to the efficient AND effective utilization of various physical infrastructure resources in the data center. Monitoring tools look for specific workloads that are:
- CPU bound – meaning that workloads are blocked because CPU’s are running at capacity
- IO bound – meaning that workloads are blocked by network and/or storage bandwidth limitations
- Latency bound – meaning that resources are waiting to process workloads
Furthermore, good monitoring tools also measure “load average.” Load average determines whether a physical server is in full use, not loaded (idle), highly loaded, or unusable due to overwhelming workloads. In Linux systems, this is done by examining run-queue utilization averaged over time. The run-queue lists processes waiting for resources to become available. The best monitoring tools identify which processes are in the run-queue and what they are waiting for. It should be noted as well, servers that are idling can identify data center performance problems just as much as highly loaded and unusable servers. Idling servers can be symptomatic of network saturations, poor load balancing, and thread locks or deadlocks.
Monitoring tools can only go so far, however. Application troubleshooting and profiling tools need to be used to help identify causes of performance problems. As an example, a profiling tool, like JProfiler, can check for Java methods that use lots of CPU resources, and it can determine how much time a Java application is spending on Garbage Collection. Some tools also provide the details of transactions within an application server, pinpointing, for instance, the SQL queries that are taking too much time to execute; or identifying which methods in a Java class are slowing down applications.
Once problem processes and applications are identified, it’s time to dig into these workloads to determine exactly how they are negatively impacting performance so fixes can be made. A list of common problems and potential solutions can be found here. Additionally, some good, quick tuning strategies can be found here if short-term, temporary repairs will suffice while long-term solutions are developed and deployed.
As noted, numerous factors impact data center performance. Therefore, IT organizations are constantly looking for the ways to proactively identify and respond to problems. Accordingly, knowing what to monitor and understanding how to improve the data center’s performance is critical. Because in today’s world, IT’s effectiveness is measured by how much they empower the strategic and tactical success of the businesses they support.