PaddleOCR-VL-1.5 ONNX Optimized (Quantized)
This repository provides the ONNX version of PaddleOCR-VL-1.5, specifically optimized for CPU inference. The models have been quantized using Intelยฎ ONNX Neural Compressor (INC) and NNCF to significantly reduce memory usage and increase inference speed.
๐ Model Files & Quantization
This project offers multiple levels of quantization so you can choose the best balance between speed and accuracy for your hardware.
1. Vision Encoder & Multi-modal Projector
The vision component handles the initial image processing and maps visual features to the language space.
vision_encoder.onnx: Full FP32 precision.vision_encoder_q8.onnx: 8-bit dynamic quantized version.vision_encoder_q4.onnx: 4-bit weight-only quantization, compressed using NNCF (Data-free).
2. Decoder (Language Model)
The autoregressive decoder responsible for generating the text/structured output.
decoder.onnx: Full FP32 precision language model.decoder_q8.onnx: 8-bit dynamic quantized version.decoder_q4.onnx: 4-bit quantized version using the GPTQ algorithm via the ONNX Neural Compressor.
3. Embedding
embed.onnx: The shared embedding layer required for token-to-vector conversion.
Key Quantization Techniques
- GPTQ (Generative Pre-trained Transformer Quantization): Applied to the 4-bit decoder to maintain high accuracy in text generation by minimizing the weight-error during compression.
- NNCF Weight-Only Quantization: Applied to the vision encoder to reduce the initial "prefill" time (Time to First Token) without requiring a calibration dataset.
๐ป How to Use :
Check out this colab notebook Demo , I used OpenVINO backend ,but you can used what ever backend depends on the hardware onnx providers
- Downloads last month
- 57
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support