Table of Contents

In real-world production environments, people care greatly about the online service performance of models (e.g., latency, throughput) and the degree of hardware utilization (e.g., compute utilization rate, memory usage), as well as the friendliness to business logic developers (e.g., system API design).

Therefore, optimizing inference before model deployment has recently become a hot research area. This mainly includes pruning, quantization, and various compilation optimizations, which are optimizations applied to the model itself. However, the performance of the entire serving system is not solely determined by the model itself. The performance of the entire system is affected by the model, the deployment framework, the hardware, and the business logic. Therefore, it is necessary to conduct comprehensive performance testing on the entire system to obtain a more scientific understanding of the system’s performance.

The Typical Model Serving Workflow

Let’s take a look at the typical model serving workflow with a client-server architecture:

The model checkpoints/weights are usually stored in a model repository, and the model serving framework loads the model from the repository to the target hardware. For example, from the local disk to the CPU memory or from the CPU memory into the GPU memory. After the serving framework is initialized, and the model is loaded into the framework. The serving framework is responsible for managing the model’s life cycle, including managing the inference workers/jobs and the hardware resources, and providing an API for the client to send requests.

When the serving servers are ready, we can use the client to send requests for inference. In the client-side, it may include a client-side request processing and batching. Usually, putting data processing in the client-side will require more processing time since the client-side is usually a mobile device or a web browser with limited computing resources. But in term of protecting user privacy, it is a good practice to do so.

When the server receives the request, it puts the request into a queue and waits for the inference worker to process it. The inference worker will load the model from the memory and execute the model inference. After the inference is done, the worker will send the result back to the client. Some serving systems (e.g., Nvidia Triton Inference Server, Tensorflow-serving) support dynamic batching, which introduces the batching process in the server-side.

Metrics for Model Serving

Now, we understand the typical model serving workflow. Let’s talk about the metrics for model serving, the metrics to measure the performance can be categorized into two categories: static metrics and dynamic metrics.

Static Metrics

Static metrics are metrics that can be calculated theoretically or from the model and hardware specifications. These metrics include:

  • Model Tasks: Classification, Detection, Segmentation, etc.
  • Model complexity: FLOPs (time), memory footprint (space)
  • Hardware capability: FLOPS (theoretical compute capability), memory, bandwidth

Dynamic Metrics

Dynamic metrics are metrics that can only be obtained through real-world performance profiling. These metrics include:

For model deployment framework (software):

  • Latency (P50, P95, P99) (ms)
  • Throughput (varying batch sizes) (req/sec)

For model acceleration hardware:

  • Average hardware utilization (%)
  • Average memory utilization (%)
  • Energy consumption per inference (J)
  • Carbon emissions per inference (mg)
  • Cold start time (sec)
  • System startup time (sec)

For the model service pipeline:

  • Pre-processing latency (ms)
  • Post-processing latency (ms)
  • Transmission time (ms)

Above metrics are mainly for serving AI models on the cloud. Similarly, for mobile devices (edge devices), inference systems pay more attention to model storage occupancy and battery consumption, so energy consumption and hardware utilization during inference should be the focus. Different systems require different analyses.

There are many excellent works for measuring the Deep learning model serving performance, including NVIDIA Model Analyzer, MLPerf, and AIBench, etc.

Obtaining Dynamic Metrics

To obtain dynamic metrics, we need to design a profiling tool that can simulate real-world scenarios. Based on the serving pipeline we metioned above, we can also draw a simple diagram for it:

We need to collect the dynamic metrics from both client and server sides. The server-side metrics are more “fine-grained” than the client-side metrics, where for end-to-end metrics, we need to obtain them from the client. Some other metrics need to be aggregated from both sides:

  • Latency (P50, P95, P99) with percentile
  • Tail latency/Distribution
  • Throughput (varying batch sizes)
  • Power consumption per inference
  • Carbon emissions per inference
  • Cloud cost per inference (if deployed on public cloud like AWS, GCP, Azure)

To compute the above metrics, we require the client to store the timestamp when sending the request and receiving the response for each request. The server-side needs to store the timestamp when receiving the request, processing the request, and sending the response. The server-side also needs to store the timestamp when loading the model, initializing the model, and executing the model inference.

Throughput can be simplify calculated by the average latency of all the requests:

\[Throughput = \frac{1}{\text{Average Latency}}\]

If the requests are batched, the throughput can be calculated by the average latency of all the batches:

\[Throughput = \frac{Batch Size}{\text{Average Latency of Batches}}\]

For the power consumption and carbon emissions, we need to check the device power meter and compute with the average serving latency. And the carbon emissions can be calculated by the power consumption and the carbon intensity of the electricity, here we recommend using an open-source tool Carbon Tracker.

Simulating Real-world Workloads

A lot of people cannot distinguish between monitoring and performance profiling. The key different between them is that monitoring is a passive process, while performance profiling is an active process with generated workloads. So how to generate the testing workload that can simulate the real-world scenarios, and gather insights for both software and hardware optimizations?

Here are five different ways to generate the inference workload:

  • High workload in a short time, testing the system’s robustness
  • High workload over a continuous period, examining the system’s tail latency
  • Blocking-style (possibly with multiple concurrencies) fixed-quantity requests, observing hardware utilization and model performance
  • Long-term system testing based on workload generated from traces (e.g., Poisson distribution generation)
  • Utilization-awared workload generation, examining the hardware capability

Block Requests Sending

The term “block requests sending” means sending thye next request only after the previous request is processed. There is an example Python code for block requests sending with 1000 requests:

import time

for i in range(1000):
    start = time.time()
    # send request
    response = send_request(data)
    end = time.time()
    print(f"Request {i} latency: {end - start}")

We can also send requests within a fixed time window, for example, sending 1000 requests within 10 seconds.

For different serving frameworks and hardwares, the block sending only get some insights when comparing with different settings. For example, using Nvidia V100 GPU and Nvidia T4 GPU, we can compare the latency and throughput with block sending requests. However, it fails to simulate the real-world scenarios, since the real-world scenarios are usually with multiple concurrencies.

Multiple Concurrencies Requests Sending

Different from using a single client to send requests, we can use multiple clients to send requests in a blocked way to increase the utilization of the server-side resources. Here is an example Python code for sending requests with multiple concurrencies:

import multiprocessing

def send_request():
    for i in range(1000):
        # send request
        response = send_request(data)

if __name__ == '__main__':
    processes = []
    for i in range(10):
        p = multiprocessing.Process(target=send_request)
        processes.append(p)
        p.start()
    
    for p in processes:
        p.join()

The above code will create 10 threads to send requests, and each thread will send 1000 requests. The server-side will receive 10000 requests in total. This can simulate the real-world scenarios with multiple concurrencies. One thing in the above implementation needs to be carefully considered is the time computing in each thread since the “thread time” is not the same as the “real time”. And you also need a callback function to trace the latency of each requests with different concurrencies.

The advantage of using multiple concurrencies requests sending is that you can increase the GPU utilization and CPU utilization by setting the number of concurrencies. However, if you want to measure the peak throughtput of a specific hardware with 100% utilization, it is hard to set the number of concurrencies.

And to test the robustness of the system, you can send all the requests in a short time by calling asynchronous requests sending. Here is an example Python code for sending requests with asynchronous requests sending:

import time

def send_request():
    for i in range(1000):
        # send request
        send_time = time.time()
        response = sync_send_request(data)

You need to collect the receive time of the reponse in the asynchronous callback function. By using this “burst” sending, you can test the system’s robustness and the system’s tail latency, since the request queue will be full in a short time.

Poisson Distribution Generation

The Poisson distribution is a probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space. The Poisson distribution can be used to simulate the real-world scenarios, since the requests are not sent in a fixed time interval. Here is an example Python code for generating Poisson distribution requests:

import random

def gen_arrival_time(duration=60 * 1, arrival_rate=5, seed=None):
    start_time = 0
    arrive_time = []

    if seed is not None:
        random.seed(seed)

    while start_time < duration:
        start_time = start_time + random.expovariate(arrival_rate)
        arrive_time.append(start_time)

    return arrive_time

You can set the arrival_rate to control the requests’ arrival rate. And you can set the duration to control the total time of the requests. Some company have their server workload traces, and you can use the traces to generate the Poisson distribution requests to simulate the real-world scenarios.

The above image shows how to use a real-world workload trace to generate a profiling workload (left). The trace data may only include the total arrived requests in minute-level, so we need to use poisson distribution to generate the requests in second-level. The right image shows the request latency in a time-series manner, which provide a straightforward way to analyze the system’s performance.

Utilization-awared Workload Generation

Tail latency is an important metric for the system’s performance. Let’s say you sent a number of requests that are over the server’s capability, many requests will be stored in the request queue, results in a long waiting time and the decrease of the average throughput. To get the accurate maximum throughput, you need to reduce the queuing time but utilize the resources to the maximum by setting the number of concurrencies based on the resource utilization.

For example, if you want to test the system’s performance when the GPU utilization is 80%, you can first set the concurrency number to 1 and increase the concurrency number until gradually when the GPU utilization stable at 80%. Although the GPU utilization will be delayed until the client receives, results in an oscillation of the GPU utilization, you can get an accurated average value by testing the performance with a longer time.

The above images shows the GPU utilization oscillation while maintaining a related stable average value.

Deep Learning Tasks

The deep learning tasks are also a part of real-world workloads, a good performance testing tool should be able to handle various deep learning tasks, including:

  • Image Classification (ResNet, MobileNet, etc.)
  • Object Detection (YOLO, Faster-RCNN, etc.)
  • Image Generation (StyleGAN, etc.)
  • Text Classification (BERT, etc.)
  • Text Generation (GPT, etc.)
  • Speech Recognition (DeepSpeech, etc.)
  • Recommendation System (Wide&Deep, etc.)

By default, the performance testing tool doesn’t care about the model’s accuracy, so the testing data can be generated randomly. It can be just a single data point repeated multiple times, or a random data point generated by the data generator. However, for some special data like a vector with all zeros, the model may have a different performance. To draw a summary, the data of deep learning tasks mainly include:

  • Single data point repeated multiple times
  • Random data point generated by the data generator
  • Special data point (e.g., all zeros vector)
  • Real-world dataset (e.g., ImageNet, COCO, etc.)

The above image shows the GPU/CPU speedup with different deep learning tasks. The speedup is calculated by the throughput of the GPU divided by the throughput of the CPU.Where OD indicate the Object Detection task, IC indicate the Image Classification task, TC indicate the Text Classification task, and GAN indicate the Image Generation task. Profining different tasks on various hardware accelerators can give an straightforward understanding of gains of upgrading hardware.

We are also interested in how different model parameters and settings will affect the hardware performance. For example, the relationship between the number of CNN/MLP Layers, Transformer Blocks and the GPU utilization, the relationship between the batch size and the GPU utilization, etc. This requires the performance profiling tool to be able to generate a set of different models with different parameters and settings for a more comprehensive analysis.

Performance Analysis

After gathering the dynamic metrics, we need to analyze the performance data to get insights for both software and hardware optimizations. Here are some common performance analysis methods:

  • KDE (Kernel Density Estimation) and CDF (Cumulative Distribution Function) for latency distribution
  • Heatmap for Different Machine Learning Models
  • Roofline Model Analysis
  • Time Series Analysis

Latency and Throughput Visualization

A good way to analyze the latency is to draw a line chart with different batch sizes, where the x-axis is the batch size and the y-axis is the latency. You can easily know the trade-off between the latency and the batch size. The throughput can be calculated by the inverse of the latency, and you can draw the bar charts for the throughput with different batch sizes.

From the right image we found that some GPUs are more powerful, but may not better than other GPUs on certain deep learning tasks, this is highly related to the hardware design and the model requirements. For examplem, transformer-based models are memory bound, where the requirement of the memory bandwidth is higher than the FLOPs, so the GPU with a higher memory bandwidth will have a better performance on the transformer-based models. We will talk this later on the Roofline Model Analysis.

KDE and CDF for Latency Distribution

Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. The Cumulative Distribution Function (CDF) is the probability that the variable takes a value less than or equal to x.

For latency analysis, KDE is suitable to check the tail latency of the system, and CDF is suitable to check the latency distribution of the system (e.g., P50/95/99 Latency). Here is an example Python code for plotting KDE and CDF using seaborn:

import seaborn as sns

sns.kdeplot(data, shade=True) # KDE
sns.ecdfplot(data) # CDF

The following images show the CDF plot of the latency distribution with different request sending settings and different serving frameworks. The dashed line represents the P50/95/99 latency where the smaller value indicates the better performance for the system’s SLO.

And the curve of CDF plots can also give us a really intuitive understanding of the system’s performance. The steeper the curve, the average latency is smaller, and the system’s performance is better.

The KDE plots are more easy to find “tails” of the latency. Since the overall size of each KDE plot is 1, the higher the KDE value, the less requests are in the tail. Some serving systems have a really high system throughput, however, the tail latency is also high, which is not acceptable in some online application – tail samples will destroy the user experience.

Heatmap for Different Machine Learning Models

We have talked about deep learning tasks, which are also a part of the real-world workloads. The heatmap can be used to give an intuitive understanding of the hardware performance with different machine learning models. Here is an example Python code for plotting the heatmap:

import seaborn as sns

sns.heatmap(data, annot=True, fmt=".2f")

The profiling tool will generate serving models with different settings. For example, the number of MLP layers ranges from 2 to 2048 with the serving batch size from 2 to 32. The values inside the heatmap cells are the average GPU utilization, with deeper colors indicating higher GPU utilization. The heatmap can give us an intuitive understanding of the relationship between the model parameters and the hardware performance.

The images from left to right are MLP, CNN and RNN models with different batch sizes. We can easily find the compability of the model with the GPU by checking the heatmap. The MLP performance are not affected by the batch size very much, with a linear increase of the GPU utilization with the number of layers. The CNN performance are very sensitive to the batch size, and utilize the GPU really well with an overall high GPU utilization. The RNN has a poor compability with the GPU, with a low GPU utilization and is bearly affected by the batch size.

The ability of generating testing model with a set of different parameters and settings is really important for see if a hardware is suitable for a specific model. Nowadays, there are increasing number of specialized hardware tailored for certain deep learning tasks, such analysis can provide insights for the hardware design.

Roofline Model Analysis

Roofline Model is an intuitive visual performance model used to provide performance estimates of a given compute kernel or application running on multi-core, many-core, or accelerator processor architectures, by showing inherent hardware limitations, and potential benefit and priority of optimizations. It can be drawn as a graph with the x-axis as the operational intensity (FLOPs/Byte) and the y-axis as the performance (FLOPs/sec), without running any performance profiling – the “roofline” is the theoretical maximum performance of the hardware.

To compute the roofline, we need to know the hardware’s theoretical compute capability (FLOPs/sec) and memory bandwidth (Bytes/sec), and the model’s FLOPs and memory footprint. The maximum operational intensity is computed by:

\[I_{max} = \frac{\text{max device FLOPS}}{\text{device bindwidth}}\]

In order to get the target model’s operational intensity and attainable performance, we need to get the total accessible memory and total FLOPs for a single inference:

\[I_{m} = \frac{\text{FLOPs}}{Mem_{kernel} + Mem_{output}} = \frac{\text{FLOPs}}{Mem_{total}}\] \[Mem_{kernel} = (\text{Parameters} * 4) + Mem_{output}\]

Where the \(Mem_{kernel}\) is the total memory footprint of the model, \(Mem_{output}\) is the memory footprint of the output tensor, and the Parameters is the total number of parameters in the model, and the 4 is the size of the float32.

And for the attainable performance:

\[P_{m} = \frac{\text{FLOPs}}{\text{Latency}} (FLOPs/sec) = \text{Throughput} * FLOPs\]

Where the throughput can be acquired from the performance testing. And then we can draw the model’s performance on the roofline graph.

The left image shows the roofline model, where the left line is the memory bandwidth limitation (memory bound), and the right line is the FLOPs limitation (memory bound). The idealy model inference performance should lay on the roofline – if the serving can really utilize the hardware, the performance should be close to the roofline. The right image shows the roofline model with different deep learning tasks, You will found non of them can reach the roofline, which means the hardware is not fully utilized, or some of the computation/memory are wasted.

Pipeline Analysis

Since we are profiling the end-to-end model serving performance, we also need to analyze the performance of the entire pipeline. The pipeline includes the pre-processing latency, the model inference latency, and the post-processing latency. The fine-grained pipeline metrics are collected from the server-side, whcih require the inference framework to provide the API for collecting them.

The above images shows the latecy with different serving batch size, while the left shows the breakdown of the pipeline latency, and the right shows the network transimission time. To analyze where is the bottleneck of the system, fine-grained pipeline analysis is really important.

And also, for different models the cold start times are different. This metric matters when the system is deployed on the cloud or with a serverless manner. We measured the cold start time by calculate the time when the container start to launch to the time when the “server ready” signal is received within a while-true loop. Below image shows the cold start time with different models on Tensorflow-serving and Nvidia Triton Inference Server.

With the increase of the served model’s size, the cold start time is also increased, which becomes a factor that cannot be ignored in the real-world scenarios.

Analysis on Optimizations

For model serving frameworks, there are many optimizations that can be applied to improve the system’s performance. For example, the dynamic batching, pipeline parallelism, etc. If we want to dive deeper, we can try to modify the settings of profiling to see the performance gain of the optimizations.

Below shows a performance comparison of the dynamic batching with different batch sizes on Tensorflow-serving and Nvidia Triton Inference Server. The throughput is calculated by the average latency of the requests.

We found that when the number of concurrencies increases, the throughput gain of Nvidia Triton Inference Server is more stable and close to linear, while the Tensorflow-serving has a more fluctuated throughput gain. Which indicates the dynamic batch size setting algorithms are different, the Nividia Triton Inference Server may use a gradually increasing batch size.

Of course, there are many optimizations other than dynamic batching, a good performance profiling tool should be able to provide a set of different settings for the optimizations, and give a comprehensive analysis of the performance gain.

Something in my mind:

  • A clear trend that is, deep learning models are getting larger and larger –which brings the challenge of the hardware/software design and the performance profiling. The hardware design will be specialized for certain deep learning tasks (e.g., language models), and the profiling should be fine-grained enough to provide insights.
  • Morden AI system/applications are not single model serving, but a combination of multiple models or system tools. Pipeline analysis is really important for locating the bottleneck of the system.

References