This folder contains a work in progress simulation of a python inference server.
The v0 version of this has a backend worker that is a single process. It loads a ResNet-18 checkpoint to ‘cuda:0’ and compiles the model. It accepts requests in the form of (tensor, request_time) from a multiprocessing.Queue
, runs inference on the request and returns (output, request_time) in the a separate response multiprocessing.Queue
.
The frontend worker is a process with three threads
For now we omit data preprocessing as well as result post-processing.
The togglable commmand line arguments to the script are as follows:
num_iters
(default: 100): how many requests to send to the backend excluding the first warmup requestbatch_size
(default: 32): the batch size of the requests.model_dir
(default: ‘.’): the directory to load the checkpoint fromcompile
(default: compile): or --no-compile
whether to torch.compile()
the modeloutput_file
(default: output.csv): The name of the csv file to write the outputs to in the results/
directory.num_workers
(default: 2): The max_threads
passed to the ThreadPoolExecutor
in charge of model predictione.g. A sample command to run the benchmark
python -W ignore server.py --num_iters 1000 --batch_size 32
the results will be found in results/output.csv
, which will be appended to if the file already exists.
Note that m.compile()
time in the csv file is not the time for the model to be compiled, which happens during the first iteration, but rather the time for PT2 components to be lazily imported (e.g. triton).
The script runner.sh
will run a sweep of the benchmark over different batch sizes with compile on and off and collect the mean and standard deviation of warmup latency, average latency, throughput and GPU utilization for each. The results/
directory will contain the metrics from running a sweep as we develop this benchmark where results/output_{batch_size}_{compile}.md
will contain the mean and standard deviation of results for a given batch size and compile setting. If the file already exists, the metrics from the run will be appended as a new row in the markdown table.