Instructions to use erayalp/blip2-flan-t5-xl-LoRA-image-captioning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use erayalp/blip2-flan-t5-xl-LoRA-image-captioning with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="erayalp/blip2-flan-t5-xl-LoRA-image-captioning")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use erayalp/blip2-flan-t5-xl-LoRA-image-captioning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "erayalp/blip2-flan-t5-xl-LoRA-image-captioning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "erayalp/blip2-flan-t5-xl-LoRA-image-captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/erayalp/blip2-flan-t5-xl-LoRA-image-captioning

SGLang

How to use erayalp/blip2-flan-t5-xl-LoRA-image-captioning with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "erayalp/blip2-flan-t5-xl-LoRA-image-captioning" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "erayalp/blip2-flan-t5-xl-LoRA-image-captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "erayalp/blip2-flan-t5-xl-LoRA-image-captioning" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "erayalp/blip2-flan-t5-xl-LoRA-image-captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use erayalp/blip2-flan-t5-xl-LoRA-image-captioning with Docker Model Runner:
```
docker model run hf.co/erayalp/blip2-flan-t5-xl-LoRA-image-captioning
```

Model Summary

This model is a fine-tuned version of BLIP-2 with the flan-t5-xl language decoder, optimized for image captioning tasks. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient adaptation. The training objective was to generate high-quality, semantically rich captions for images from the Open Images dataset.

It was developed for a captioning competition evaluated using Fréchet GTE Distance (FGD), which uses GTE-small embeddings to assess the alignment of image and caption semantics.

Training Objective

Task: Image Captioning
Base Model: Salesforce/blip2-flan-t5-xl
Backbone: Frozen ViT-G + frozen Q-Former
Decoder: Fine-tuned flan-t5-xl with LoRA
Loss: Cross-entropy with optional GTE-aware auxiliary loss
Evaluation: Fréchet GTE Distance (FGD) between image and caption embeddings

⸻

Dataset

Training Dataset: Subset of Open Images with curated image-caption pairs
Augmentation: Synthetic captions
Image Features: Preprocessed using BLIP-2’s frozen vision encoder

⸻

Fine-Tuning Configuration

LoRA Rank: 128
Alpha: 128
Dropout: 0.05
Target Modules q, v, and k in attention blocks
Precision: bfloat16
Optimizer: AdamW
Learning Rate: 3e-5
Scheduler: Cosine with warmup
Batch Size: 32
Epochs: 4
Accumulation: 2 gradient accumulation steps
Logging: Weights & Biases

⸻

Performance

Fréchet GTE Distance (↓), Achieved competitive score in the competition
Caption Quality: High semantic alignment and fluency
Improved OCR capabilities

⸻

Usage

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning")
model = BlipForConditionalGeneration.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning").to("cuda")

img = Image.open("example.jpg").convert("RGB")
prompt = "Provide a detailed caption for this photo."
prompts = [prompt] * len(images)
inputs = processor(images=img, text=prompts, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(outputs[0], skip_special_tokens=True)

print(caption)

⸻

Limitations

May underperform on out-of-domain images or very abstract concepts
Quality of captions may vary depending on scene complexity
Not trained on video or temporal sequences

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for erayalp/blip2-flan-t5-xl-LoRA-image-captioning

Base model

Salesforce/blip2-flan-t5-xl

Finetuned

(2)

this model