How to use from
SGLangUse Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "erayalp/blip2-flan-t5-xl-LoRA-image-captioning" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "erayalp/blip2-flan-t5-xl-LoRA-image-captioning",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'Quick Links
Model Summary
This model is a fine-tuned version of BLIP-2 with the flan-t5-xl language decoder, optimized for image captioning tasks. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient adaptation. The training objective was to generate high-quality, semantically rich captions for images from the Open Images dataset.
It was developed for a captioning competition evaluated using Fréchet GTE Distance (FGD), which uses GTE-small embeddings to assess the alignment of image and caption semantics.
Training Objective
- Task: Image Captioning
- Base Model: Salesforce/blip2-flan-t5-xl
- Backbone: Frozen ViT-G + frozen Q-Former
- Decoder: Fine-tuned flan-t5-xl with LoRA
- Loss: Cross-entropy with optional GTE-aware auxiliary loss
- Evaluation: Fréchet GTE Distance (FGD) between image and caption embeddings
⸻
Dataset
- Training Dataset: Subset of Open Images with curated image-caption pairs
- Augmentation: Synthetic captions
- Image Features: Preprocessed using BLIP-2’s frozen vision encoder
⸻
Fine-Tuning Configuration
- LoRA Rank: 128
- Alpha: 128
- Dropout: 0.05
- Target Modules q, v, and k in attention blocks
- Precision: bfloat16
- Optimizer: AdamW
- Learning Rate: 3e-5
- Scheduler: Cosine with warmup
- Batch Size: 32
- Epochs: 4
- Accumulation: 2 gradient accumulation steps
- Logging: Weights & Biases
⸻
Performance
- Fréchet GTE Distance (↓), Achieved competitive score in the competition
- Caption Quality: High semantic alignment and fluency
- Improved OCR capabilities
⸻
Usage
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
processor = BlipProcessor.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning")
model = BlipForConditionalGeneration.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning").to("cuda")
img = Image.open("example.jpg").convert("RGB")
prompt = "Provide a detailed caption for this photo."
prompts = [prompt] * len(images)
inputs = processor(images=img, text=prompts, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)
⸻
Limitations
- May underperform on out-of-domain images or very abstract concepts
- Quality of captions may vary depending on scene complexity
- Not trained on video or temporal sequences
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for erayalp/blip2-flan-t5-xl-LoRA-image-captioning
Base model
Salesforce/blip2-flan-t5-xl
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "erayalp/blip2-flan-t5-xl-LoRA-image-captioning" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "erayalp/blip2-flan-t5-xl-LoRA-image-captioning", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'