PaddleOCR-VL-1.5 ONNX Optimized (Quantized)

This repository provides the ONNX version of PaddleOCR-VL-1.5, specifically optimized for CPU inference. The models have been quantized using Intelยฎ ONNX Neural Compressor (INC) and NNCF to significantly reduce memory usage and increase inference speed.

๐Ÿ›  Model Files & Quantization

This project offers multiple levels of quantization so you can choose the best balance between speed and accuracy for your hardware.

1. Vision Encoder & Multi-modal Projector

The vision component handles the initial image processing and maps visual features to the language space.

  • vision_encoder.onnx: Full FP32 precision.
  • vision_encoder_q8.onnx: 8-bit dynamic quantized version.
  • vision_encoder_q4.onnx: 4-bit weight-only quantization, compressed using NNCF (Data-free).

2. Decoder (Language Model)

The autoregressive decoder responsible for generating the text/structured output.

  • decoder.onnx: Full FP32 precision language model.
  • decoder_q8.onnx: 8-bit dynamic quantized version.
  • decoder_q4.onnx: 4-bit quantized version using the GPTQ algorithm via the ONNX Neural Compressor.

3. Embedding

  • embed.onnx: The shared embedding layer required for token-to-vector conversion.

Key Quantization Techniques

  • GPTQ (Generative Pre-trained Transformer Quantization): Applied to the 4-bit decoder to maintain high accuracy in text generation by minimizing the weight-error during compression.
  • NNCF Weight-Only Quantization: Applied to the vision encoder to reduce the initial "prefill" time (Time to First Token) without requiring a calibration dataset.

๐Ÿ’ป How to Use :

Check out this colab notebook Demo , I used OpenVINO backend ,but you can used what ever backend depends on the hardware onnx providers

Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support