YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π οΈ Code LLM Toolkit: Fine-tune + RAG + Tool-Calling for Internal Codebases
A complete toolkit for building a Python code generation LLM that can search your internal codebase via RAG, call tools, and reason through multi-step tasks.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Code LLM System β
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ
β β Fine-tuned β β RAG Pipeline β β Tool Executor ββ
β β Qwen2.5- βββββ (AST-aware β β - search_codebase ββ
β β Coder-7B β β chunking + β β - execute_python ββ
β β + LoRA β β embeddings) β β - read_file ββ
β ββββββββ¬ββββββββ ββββββββββββββββ β - run_tests ββ
β β ββββββββββββ¬βββββββββββββ
β β ReAct Agent Loop β β
β ββββββββββββββββββββ¬ββββββββββββββββββββββββ β
β β β
β ββββββββΌβββββββ β
β β Response β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Research Foundation
Every component is grounded in published research with verified results:
| Component | Paper | Key Result |
|---|---|---|
| Base Model | Qwen2.5-Coder | HumanEval 88.4% (7B), SOTA open-source |
| Tool-Calling Data | ToolACE | Beats GPT-4-turbo on BFCL benchmark |
| Multi-turn Agent Data | APIGen-MT | 78.19% BFCL v3 (#1, beats o1/GPT-4o) |
| Code SFT Data | Magicoder | HumanEval 70.7% from 7B with 185K samples |
| RAG Chunking | cAST | +5.6pp over fixed-size on RepoEval |
| RAG Strategy | AllianceCoder | API signatures > similar code (+20%) |
| Retriever-Aware Training | Gorilla | Outperforms GPT-4 on API accuracy |
| Code Embeddings | CodeSage-v2 | Best open code embedding model |
| LoRA for Code | Astraios | LoRA matches FFT at β₯16B scale |
Quick Start
Step 1: Prepare Training Data
Merges 4 verified datasets (ToolACE + APIGen-MT + Magicoder + CodeAct) into a unified ChatML format:
pip install datasets
# Test with small sample first
python prepare_data.py --max_per_source 100 --dry_run
# Full run β pushes merged dataset to Hub
python prepare_data.py --output_repo your-username/code-toolcall-sft-data
Dataset composition (~110K examples):
| Source | Examples | Purpose |
|---|---|---|
| Team-ACE/ToolACE | 26K | Tool-calling (single-turn) |
| Salesforce/APIGen-MT-5k | 5K | Multi-turn agentic tool use |
| Magicoder-OSS-Instruct-75K | ~25K (Python) | Python code generation |
| xingyaoww/code-act | 7K | Code-as-action (tools via Python) |
Step 2: Add Your Internal Codebase Data
This is the most impactful step. Use the Gorilla/Magicoder pattern:
- OSS-Instruct on your code: Sample random snippets from your internal repo β use an LLM (GPT-4o, Claude) to generate instruction-solution pairs seeded from that code
- Retriever-aware examples: Include retrieved code context in training prompts so the model learns to use RAG at inference time
- Internal API documentation: Convert your docstrings/README into Q&A pairs
See prepare_data.py for the format β add your examples as additional sources.
Step 3: Fine-tune
# Edit train_sft.py to set your dataset and model repo IDs, then:
# Option A: Run on HF Jobs (recommended for A100/H100 hardware)
# Use the hf_jobs API or CLI
# Option B: Run locally with GPU
pip install trl peft transformers datasets trackio accelerate torch
python train_sft.py
Training configuration (from literature):
- Base: Qwen/Qwen2.5-Coder-7B-Instruct (Apache 2.0)
- Method: LoRA (r=32, alpha=64) on all linear layers
- LR: 1e-4 with cosine schedule, 10% warmup
- Epochs: 2
- Context: 8192 tokens
- Loss: Assistant-only (masks user/system/tool tokens)
- Hardware: 1x A100-80GB (or 2x A10G-24GB)
- Time: ~4-6 hours for 110K examples
Step 4: Index Your Codebase (RAG)
from rag_pipeline import CodebaseIndexer
# Index your internal Python codebase
indexer = CodebaseIndexer(
"/path/to/your/repo",
embedding_model="jinaai/jina-embeddings-v2-base-code" # or codesage-large-v2
)
retriever = indexer.index()
# Save for reuse
retriever.save_index("./my_index")
# Search!
results = retriever.search("authentication token validation", top_k=5)
for chunk, score in results:
print(f"[{score:.3f}] {chunk.file_path}/{chunk.name}: {chunk.signature}")
RAG pipeline features:
- AST-aware chunking (cAST): Functions, methods, classes stay intact β no mid-function cuts
- Dual embeddings: Code content + metadata strings (NL descriptions) for hybrid search
- AllianceCoder context assembly: API signatures prioritized over full code bodies
- In-context dependencies: Automatically extracts imports and class signatures from the current file
- Embedding models: Jina-Code-v2 (8K context, 161M) or CodeSage-v2 (best quality, 1.3B)
Step 5: Run the Agent
# Interactive mode
python inference.py \
--model your-username/qwen25-coder-7b-code-toolcall \
--repo /path/to/your/codebase \
--index-dir ./my_index
# Single query
python inference.py \
--model your-username/qwen25-coder-7b-code-toolcall \
--repo /path/to/your/codebase \
--query "Add pagination to the product search endpoint"
The agent uses a ReAct loop:
- Pre-fetches relevant code via RAG
- Sends query + context to the LLM
- If the LLM calls tools β executes them β feeds results back
- Repeats until the LLM gives a final answer (max 10 turns)
Recommended Embedding Models
| Model | Size | Context | Best For | HF Link |
|---|---|---|---|---|
codesage/codesage-large-v2 |
1.3B | 2048 tok | Best quality (NLβCode 69.4) | Link |
jinaai/jina-embeddings-v2-base-code |
161M | 8192 tok | Long files, 30 languages | Link |
codesage/codesage-small-v2 |
130M | 2048 tok | Fast, lightweight | Link |
Advanced: Full Fine-Tuning (FFT)
For maximum performance, skip LoRA and do full fine-tuning:
# In train_sft.py, remove peft_config and adjust:
LEARNING_RATE = 2e-5 # 10x lower than LoRA
BATCH_SIZE = 1 # Lower to fit in memory
GRAD_ACCUM = 16 # Keep effective batch = 16
# Hardware: 2x A100-80GB minimum for 7B FFT
Per Astraios: FFT slightly outperforms LoRA at 7B scale, but LoRA is within 1% and 30x more parameter-efficient.
Advanced: GRPO Reinforcement Learning (Stage 2)
After SFT, you can further improve the model with GRPO using execution-based rewards:
from trl import GRPOConfig, GRPOTrainer
# Reward function: does the generated code pass unit tests?
def reward_fn(completions, prompts):
rewards = []
for code in completions:
try:
exec(code, {}) # Sandbox this properly!
rewards.append(1.0)
except:
rewards.append(0.0)
return rewards
# Train with GRPO
config = GRPOConfig(
learning_rate=1e-6,
num_train_epochs=1,
per_device_train_batch_size=4,
)
File Structure
βββ prepare_data.py # Dataset merging & formatting
βββ train_sft.py # SFT training script (TRL + LoRA)
βββ rag_pipeline.py # AST-aware indexing & retrieval
βββ inference.py # ReAct agent with tool calling
βββ README.md # This file
Requirements
transformers>=4.45.0
trl>=1.0.0
peft>=0.12.0
datasets>=3.0.0
accelerate>=1.0.0
trackio>=0.2.0
torch>=2.0.0
sentence-transformers>=3.0.0 # For RAG embeddings
numpy
scikit-learn # TF-IDF fallback
Citation
If you use this toolkit, please cite the underlying research:
@article{qwen2.5coder,
title={Qwen2.5-Coder Technical Report},
author={Hui, Binyuan and others},
journal={arXiv:2409.12186},
year={2024}
}
@article{toolace,
title={ToolACE: Winning the Points of LLM Function Calling},
author={Liu, Weiwen and others},
journal={arXiv:2409.00920},
year={2024}
}
@article{gorilla,
title={Gorilla: Large Language Model Connected with Massive APIs},
author={Patil, Shishir G. and others},
journal={arXiv:2305.15334},
year={2023}
}