Instructions to use infly/Infinity-Parser2-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use infly/Infinity-Parser2-Pro with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="infly/Infinity-Parser2-Pro")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
model = AutoModelForImageTextToText.from_pretrained("infly/Infinity-Parser2-Pro")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use infly/Infinity-Parser2-Pro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "infly/Infinity-Parser2-Pro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "infly/Infinity-Parser2-Pro",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/infly/Infinity-Parser2-Pro

SGLang

How to use infly/Infinity-Parser2-Pro with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "infly/Infinity-Parser2-Pro" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "infly/Infinity-Parser2-Pro",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "infly/Infinity-Parser2-Pro" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "infly/Infinity-Parser2-Pro",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use infly/Infinity-Parser2-Pro with Docker Model Runner:
```
docker model run hf.co/infly/Infinity-Parser2-Pro
```

BBoxes are super weird

by glebkudr - opened 3 days ago

Discussion

glebkudr

3 days ago

•

edited 3 days ago

Hi! Thank you for releasing Infinity-Parser2-Pro.

I am testing the model for document layout parsing and noticed an issue with bounding boxes on some real documents.

For example, I send a non-square document image to the model, e.g. 1280x923 px. The raw model output returns bboxes in normalized [0..1000] coordinates, which seems consistent with the official Python package postprocessing (restore_abs_bbox_coordinates, scaling bbox coordinates from 0..1000 back to image pixels).

However, after applying the same scaling formula:

x_px = x / 1000 * image_width
y_px = y / 1000 * image_height

some bounding boxes still look inaccurate. They are often shifted or too large, especially around tables, flowcharts, dense text areas, and structured documents. In some cases the boxes look more like rough layout regions than precise text/element boxes.

Could you please clarify:

Are raw bbox coordinates always expected to be normalized to a 1000x1000 coordinate space?
Should they be scaled independently by image width and image height, as done in restore_abs_bbox_coordinates?
Are the bboxes intended to be precise text/element boxes, or only approximate layout regions?
Is there a recommended prompt or postprocessing step to improve bbox accuracy?
For non-square images, is there any internal resize/letterbox/crop behavior that should be accounted for when mapping bboxes back to pixels?

I can share examples where the image is 1280x923, the model returns bbox values like [53, 103, 109, 144], and scaling them to pixels mostly follows the document layout, but the resulting boxes are still visibly misaligned or overly broad.

Thanks!

KexuanRen

inftech.ai org 2 days ago

Thanks for the detailed report and for testing Infinity-Parser2-Pro!

To address your questions:

The coordinate pipeline is correct. The model predicts bounding boxes in a 1000×1000 normalized coordinate space and scales them back to pixel coordinates using x_px = x / 1000 * image_width and y_px = y / 1000 * image_height — the same strategy used by Qwen3-VL and Qwen3.5.
The model was primarily trained on high-resolution document images. For lower-resolution inputs (such as your 648 × 781 example), localization accuracy may degrade, which can result in shifted or overly broad bounding boxes — especially around dense regions like tables, flowcharts, and structured layouts. We plan to address this in the next release.
Recommended workaround. As a temporary fix, we recommend upscaling the image so that its longer side is at least 1000 pixels before passing it to the model. This should meaningfully improve bounding box accuracy for lower-resolution inputs. We tested your example by resizing the image to 923×1280 (long side ≥ 1000), and the bounding boxes were accurate.

We appreciate you surfacing this with concrete examples — feedback like this directly informs our next iteration.

zuminghuang changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment