3 minute read

TensorRT 是 NVIDIA 发布的推理框架,用于在 NVIDIA GPUs 上进行深度学习模型部署;针对 NVIDIA 显卡做了优化;支持 TensorFlow、Caffe、PyTorch、MXnet、CNTK等所有主流框架及 ONNX;
官网: https://developer.nvidia.com/tensorrt
ONNX TensorRT: https://github.com/onnx/onnx-tensorrt
官方文档汇总: https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html,
python API: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/
支持语言: 支持 C++(核心) 和 Python;

0 背景

CNN 的效率一直备受关注,主要思路是剪枝和量化; TnesorRT 就是进行量化,将 FP32 位权值数据优化为 FP16 或者 INT8,而推理精度不发生明显的降低;

  1. 只支持推理,不支持训练;
  2. 底层针对NVIDIA显卡做了多方面的优化,不仅仅是量化,可以和 CUDA CODEC SDK 结合使用,也就是另一个开发包 DeepStream;
  3. 独立于其他深度学习框架,通过解析框架文件来实现,不需要额外安装 DL 库;

1 安装

1.1 Linux 环境

安装报错解决方案
1)准备

2)安装

sudo dpkg -i nv-tensorrt*.deb
sudo apt-get uplast_modified_at
sudo apt-get install tensorrt

3)检测是否安装成功
dpkg -l | grep TensorRT
有输出则是安装成功

4)卸载

TensorRT 文件组织

2 HelloWorld

2.1 caffe

  1. 模型转换
  2. 调用

    2.2 pyTorch

  3. torch_trt

TOP

附录

A 基本操作

1. 安装时报错
1) 密钥没注册
提示: The public CUDA GPG key does not appear to be installed. To install the key, run this command: sudo apt-key add /var/nv-tensorrt-repo-cuda8.0-ga-trt4.0.1.6-20180612/7fa2af80.pub

解决: 执行最后提示的那句话 sudo apt-key add /var/nv-tensorrt*.pub

2) 缺乏依赖
报错信息:

The following packages have unmet dependencies:
 tensorrt : Depends: libnvinfer4 (>= 4.1.2) but it is not going to be installed
            Depends: libnvinfer-dev (>= 4.1.2) but it is not going to be installed
            Depends: libnvinfer-samples (>= 4.1.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

分析: 缺什么安什么,最终发现缺乏 cuda-cublas-8-0;是因为本机 cuda 使用了 run 安装,而 TensorRT 的 cuda 用的是 deb 安装的,deb 中有特别的依赖库;

解决: 准备 cuda deb 包即可
下载 cuda 8.0 的相关 deb 包(如下图红框所示) https://developer.nvidia.com/cuda-80-ga2-download-archive

# 准备 cuda 依赖
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-cublas-performance-uplast_modified_at_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb
sudo apt-get uplast_modified_at

# 安装依赖
sudo apt-get install python3-libnvinfer-doc

2. TensorRT 的文件组织
安装后会生成 /usr/src/tensorrt 文件夹:

tensorrt
    ├── bin     # 文件夹用于存放编译后的二进制文件
    ├── data
    ├── python  # 和 data 一起存放官方例程用到的资源文件,比如caffemodel文件,TensorFlow模型文件,一些图片等
    └── samples # 文件夹中是官方例程的源码

进入 samples 文件夹直接 make,会在 bin 目录中生成可执行文件,可以一一进行测试学习;

tensorRT 是不开源的:
1) 头文件位于 /usr/include/x86_64-linux-gnu

NvCaffeParser.h, NvInfer.h, NvInferPlugin.h, NvOnnxConfig.h, NvOnnxParser.h, NvUffParser.h, NvUtils.h

2) 库文件位于 /usr/lib/x86_64-linux-gnu


libnvinfer.so, libnvToolsExt.so, libnvinfer_plugin.a, libnvinfer_plugin.so.4, libnvcaffe_parser.so, libnvparsers.so.4.1.2, stubs/libnvrtc.so, libnvcaffe_parser.a, libnvidia-opencl.so.1, libnvvm.so, libnvinfer.a, libnvvm.so.3, libnvToolsExt.so.1, libnvrtc.so.7.5, libnvparsers.a, libnvblas.so.7.5, libnvToolsExt.so.1.0.0, libnvcaffe_parser.so.4.1.2, libnvinfer_plugin.so, libnvrtc-builtins.so, libnvparsers.so, libnvrtc-builtins.so.7.5.18, libnvblas.so.7.5.18, libnvvm.so.3.0.0, libnvrtc.so, libnvrtc-builtins.so.7.5, libnvinfer.so.4.1.2, libnvidia-opencl.so.390.30, libnvrtc.so.7.5.17, libnvblas.so, libnvinfer.so.4, libnvparsers.so.4, libnvinfer_plugin.so.4.1.2, libnvcaffe_parser.so.4

B 推荐资料

  1. arleyzhang. TensorRT(5)-INT8校准原理[EB/OL]. https://arleyzhang.github.io/articles/923e2c40/. 2018年09月03.
  2. NVIDIA. TensorRT Installation Guide[EB/OL]. https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt_401/tensorrt-install-guide/index.html#installing-debian. -/2019-02-12.
  3. arleyzhang. TensorRT(1)-介绍-使用-安装[EB/OL]. https://arleyzhang.github.io/articles/7f4b25ce/. 2018-08-31/2019-02-12.
  4. linolzhang. TensorRT深度学习推理框架介绍[EB/OL]. https://blog.csdn.net/linolzhang/article/details/79079863. 2018-01-16/2019-09-18.

C 代码示例

1. 转换 caffe 模型

void caffeToGIEModel( const std::string& deployFile,    // name for caffe prototxt
const std::string& modelFile,    // name for model 
const std::vector<std::string>& outputs,   // network outputs
unsigned int maxBatchSize,   // batch size - NB must be at least as large as the batch we want to run with)
IHostMemory *&gieModelStream)           // output buffer for the GIE model
{
  // 1.创建builder
  IBuilder* builder = createInferBuilder(gLogger);

  // 2.解析caffe模型,保存到 Network
  INetworkDefinition* network = builder->createNetwork();
  ICaffeParser* parser = createCaffeParser();
  const IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(deployFile,                 directories).c_str(), locateFile(modelFile, directories).c_str(),*network, DataType::kFLOAT);

  // 3.指定输出Tensor
  for (auto& s : outputs)
    network->markOutput(*blobNameToTensor->find(s.c_str()));

  // 4.构建engine
  builder->setMaxBatchSize(maxBatchSize);
  builder->setMaxWorkspaceSize(1 << 20);

  ICudaEngine* engine = builder->buildCudaEngine(*network);
  assert(engine);

  // 5.销毁parser
  network->destroy();
  parser->destroy();

  // 6.将engine序列化到GIE,退出
  gieModelStream = engine->serialize();
  engine->destroy();
  builder->destroy();
}

1. 转换 caffe 模型

// 推理
void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
  const ICudaEngine& engine = context.getEngine();
  // 传递给引擎的输入输出buffer指针- 需要精确的 IEngine::getNbBindings(),这里1个输入+1个输出
  assert(engine.getNbBindings() == 2);
  void* buffers[2];

  // 1.为了绑定buffer,需要知道输入和输出tensor的names
  int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME),
  outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

  // 2.创建 GPU buffer 和 stream
  CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
  CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));

  cudaStream_t stream;
  CHECK(cudaStreamCreate(&stream));

  // 3.通过DMA 输入到 GPU,  异步之行batch,并通过DMA回传
  CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
  context.enqueue(batchSize, buffers, stream, nullptr);
  CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));
  cudaStreamSynchronize(stream);

  // 4.释放 stream 和 buffer
  cudaStreamDestroy(stream);
  CHECK(cudaFree(buffers[inputIndex]));
  CHECK(cudaFree(buffers[outputIndex]));  
}

int int main(int argc, char const *argv[])
{
  // 1.从caffe模型创建GIE模型,序列化到流
  IHostMemory *gieModelStream{nullptr};
  caffeToGIEModel("mnist.prototxt", "mnist.caffemodel", std::vector < std::string > { OUTPUT_BLOB_NAME }, 1, gieModelStream);

  // x.数据获取(略)
  // x.解析mean文件(略)

  // 2.反序列化,得到Runtime engine
  IRuntime* runtime = createInferRuntime(gLogger);
  ICudaEngine* engine = runtime->deserializeCudaEngine(gieModelStream->data(), gieModelStream->size(), nullptr);
  if (gieModelStream) gieModelStream->destroy();

  // 3.创建上下文
  IExecutionContext *context = engine->createExecutionContext();

  // 4.运行inference
  float prob[OUTPUT_SIZE];
  doInference(*context, data, prob, 1);

  // 5.销毁engine
  context->destroy();
  engine->destroy();
  runtime->destroy();

  return 0;
}

Comments