INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT

YouTube
CONFLUENCE

Description

Accelerating Deep Neural Networks (DNN) inference is an important step in realizing latency-critical deployment of real-world applications such as image classification, image segmentation, natural language processing, etc. The need to improve DNN's inference latency has sparked interest in running those models in lower precisions, such as FP16 and INT8. In particular, running DNNs in INT8 precision can offer faster inference and a much lower memory footprint than its floating-point counterpart. NVIDIA TensorRT supports Quantization-Aware Training (QAT) techniques to convert floating-point DNN models to INT8 precision. In this talk, we shall demonstrate end-end workflow of converting Tensorflow QAT models into ONNX, which is a standard intermediate representation to deploy using TensorRT. We use TF2ONNX package to convert a quantized Tensorflow model into ONNX. ONNX format makes it easier to visualize graphs via netron which can provide users information about placement of quantized nodes.

OnnxVideo

INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT

Description

Details