YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ› οΈ Code LLM Toolkit: Fine-tune + RAG + Tool-Calling for Internal Codebases

A complete toolkit for building a Python code generation LLM that can search your internal codebase via RAG, call tools, and reason through multi-step tasks.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Code LLM System                             β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚  Fine-tuned   β”‚   β”‚  RAG Pipeline β”‚   β”‚  Tool Executor       β”‚β”‚
β”‚  β”‚  Qwen2.5-     │◄──│  (AST-aware   β”‚   β”‚  - search_codebase  β”‚β”‚
β”‚  β”‚  Coder-7B     β”‚   β”‚   chunking +  β”‚   β”‚  - execute_python   β”‚β”‚
β”‚  β”‚  + LoRA       β”‚   β”‚   embeddings) β”‚   β”‚  - read_file        β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  - run_tests        β”‚β”‚
β”‚         β”‚                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚         β”‚            ReAct Agent Loop              β”‚            β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                            β”‚                                     β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚                     β”‚   Response   β”‚                              β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Research Foundation

Every component is grounded in published research with verified results:

Component Paper Key Result
Base Model Qwen2.5-Coder HumanEval 88.4% (7B), SOTA open-source
Tool-Calling Data ToolACE Beats GPT-4-turbo on BFCL benchmark
Multi-turn Agent Data APIGen-MT 78.19% BFCL v3 (#1, beats o1/GPT-4o)
Code SFT Data Magicoder HumanEval 70.7% from 7B with 185K samples
RAG Chunking cAST +5.6pp over fixed-size on RepoEval
RAG Strategy AllianceCoder API signatures > similar code (+20%)
Retriever-Aware Training Gorilla Outperforms GPT-4 on API accuracy
Code Embeddings CodeSage-v2 Best open code embedding model
LoRA for Code Astraios LoRA matches FFT at β‰₯16B scale

Quick Start

Step 1: Prepare Training Data

Merges 4 verified datasets (ToolACE + APIGen-MT + Magicoder + CodeAct) into a unified ChatML format:

pip install datasets

# Test with small sample first
python prepare_data.py --max_per_source 100 --dry_run

# Full run β€” pushes merged dataset to Hub
python prepare_data.py --output_repo your-username/code-toolcall-sft-data

Dataset composition (~110K examples):

Source Examples Purpose
Team-ACE/ToolACE 26K Tool-calling (single-turn)
Salesforce/APIGen-MT-5k 5K Multi-turn agentic tool use
Magicoder-OSS-Instruct-75K ~25K (Python) Python code generation
xingyaoww/code-act 7K Code-as-action (tools via Python)

Step 2: Add Your Internal Codebase Data

This is the most impactful step. Use the Gorilla/Magicoder pattern:

  1. OSS-Instruct on your code: Sample random snippets from your internal repo β†’ use an LLM (GPT-4o, Claude) to generate instruction-solution pairs seeded from that code
  2. Retriever-aware examples: Include retrieved code context in training prompts so the model learns to use RAG at inference time
  3. Internal API documentation: Convert your docstrings/README into Q&A pairs

See prepare_data.py for the format β€” add your examples as additional sources.

Step 3: Fine-tune

# Edit train_sft.py to set your dataset and model repo IDs, then:

# Option A: Run on HF Jobs (recommended for A100/H100 hardware)
# Use the hf_jobs API or CLI

# Option B: Run locally with GPU
pip install trl peft transformers datasets trackio accelerate torch
python train_sft.py

Training configuration (from literature):

  • Base: Qwen/Qwen2.5-Coder-7B-Instruct (Apache 2.0)
  • Method: LoRA (r=32, alpha=64) on all linear layers
  • LR: 1e-4 with cosine schedule, 10% warmup
  • Epochs: 2
  • Context: 8192 tokens
  • Loss: Assistant-only (masks user/system/tool tokens)
  • Hardware: 1x A100-80GB (or 2x A10G-24GB)
  • Time: ~4-6 hours for 110K examples

Step 4: Index Your Codebase (RAG)

from rag_pipeline import CodebaseIndexer

# Index your internal Python codebase
indexer = CodebaseIndexer(
    "/path/to/your/repo",
    embedding_model="jinaai/jina-embeddings-v2-base-code"  # or codesage-large-v2
)
retriever = indexer.index()

# Save for reuse
retriever.save_index("./my_index")

# Search!
results = retriever.search("authentication token validation", top_k=5)
for chunk, score in results:
    print(f"[{score:.3f}] {chunk.file_path}/{chunk.name}: {chunk.signature}")

RAG pipeline features:

  • AST-aware chunking (cAST): Functions, methods, classes stay intact β€” no mid-function cuts
  • Dual embeddings: Code content + metadata strings (NL descriptions) for hybrid search
  • AllianceCoder context assembly: API signatures prioritized over full code bodies
  • In-context dependencies: Automatically extracts imports and class signatures from the current file
  • Embedding models: Jina-Code-v2 (8K context, 161M) or CodeSage-v2 (best quality, 1.3B)

Step 5: Run the Agent

# Interactive mode
python inference.py \
  --model your-username/qwen25-coder-7b-code-toolcall \
  --repo /path/to/your/codebase \
  --index-dir ./my_index

# Single query
python inference.py \
  --model your-username/qwen25-coder-7b-code-toolcall \
  --repo /path/to/your/codebase \
  --query "Add pagination to the product search endpoint"

The agent uses a ReAct loop:

  1. Pre-fetches relevant code via RAG
  2. Sends query + context to the LLM
  3. If the LLM calls tools β†’ executes them β†’ feeds results back
  4. Repeats until the LLM gives a final answer (max 10 turns)

Recommended Embedding Models

Model Size Context Best For HF Link
codesage/codesage-large-v2 1.3B 2048 tok Best quality (NL→Code 69.4) Link
jinaai/jina-embeddings-v2-base-code 161M 8192 tok Long files, 30 languages Link
codesage/codesage-small-v2 130M 2048 tok Fast, lightweight Link

Advanced: Full Fine-Tuning (FFT)

For maximum performance, skip LoRA and do full fine-tuning:

# In train_sft.py, remove peft_config and adjust:
LEARNING_RATE = 2e-5    # 10x lower than LoRA
BATCH_SIZE = 1          # Lower to fit in memory
GRAD_ACCUM = 16         # Keep effective batch = 16
# Hardware: 2x A100-80GB minimum for 7B FFT

Per Astraios: FFT slightly outperforms LoRA at 7B scale, but LoRA is within 1% and 30x more parameter-efficient.

Advanced: GRPO Reinforcement Learning (Stage 2)

After SFT, you can further improve the model with GRPO using execution-based rewards:

from trl import GRPOConfig, GRPOTrainer

# Reward function: does the generated code pass unit tests?
def reward_fn(completions, prompts):
    rewards = []
    for code in completions:
        try:
            exec(code, {})  # Sandbox this properly!
            rewards.append(1.0)
        except:
            rewards.append(0.0)
    return rewards

# Train with GRPO
config = GRPOConfig(
    learning_rate=1e-6,
    num_train_epochs=1,
    per_device_train_batch_size=4,
)

File Structure

β”œβ”€β”€ prepare_data.py     # Dataset merging & formatting
β”œβ”€β”€ train_sft.py        # SFT training script (TRL + LoRA)
β”œβ”€β”€ rag_pipeline.py     # AST-aware indexing & retrieval
β”œβ”€β”€ inference.py        # ReAct agent with tool calling
└── README.md           # This file

Requirements

transformers>=4.45.0
trl>=1.0.0
peft>=0.12.0
datasets>=3.0.0
accelerate>=1.0.0
trackio>=0.2.0
torch>=2.0.0
sentence-transformers>=3.0.0  # For RAG embeddings
numpy
scikit-learn  # TF-IDF fallback

Citation

If you use this toolkit, please cite the underlying research:

@article{qwen2.5coder,
  title={Qwen2.5-Coder Technical Report},
  author={Hui, Binyuan and others},
  journal={arXiv:2409.12186},
  year={2024}
}
@article{toolace,
  title={ToolACE: Winning the Points of LLM Function Calling},
  author={Liu, Weiwen and others},
  journal={arXiv:2409.00920},
  year={2024}
}
@article{gorilla,
  title={Gorilla: Large Language Model Connected with Massive APIs},
  author={Patil, Shishir G. and others},
  journal={arXiv:2305.15334},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for rndubs/code-llm-toolkit