YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🛠️ Code LLM Toolkit: Fine-tune + RAG + Tool-Calling for Internal Codebases

A complete toolkit for building a Python code generation LLM that can search your internal codebase via RAG, call tools, and reason through multi-step tasks.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Code LLM System                             │
│                                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐│
│  │  Fine-tuned   │   │  RAG Pipeline │   │  Tool Executor       ││
│  │  Qwen2.5-     │◄──│  (AST-aware   │   │  - search_codebase  ││
│  │  Coder-7B     │   │   chunking +  │   │  - execute_python   ││
│  │  + LoRA       │   │   embeddings) │   │  - read_file        ││
│  └──────┬───────┘   └──────────────┘   │  - run_tests        ││
│         │                               └──────────┬───────────┘│
│         │            ReAct Agent Loop              │            │
│         └──────────────────┬───────────────────────┘            │
│                            │                                     │
│                     ┌──────▼──────┐                              │
│                     │   Response   │                              │
│                     └─────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

Research Foundation

Every component is grounded in published research with verified results:

Component	Paper	Key Result
Base Model	Qwen2.5-Coder	HumanEval 88.4% (7B), SOTA open-source
Tool-Calling Data	ToolACE	Beats GPT-4-turbo on BFCL benchmark
Multi-turn Agent Data	APIGen-MT	78.19% BFCL v3 (#1, beats o1/GPT-4o)
Code SFT Data	Magicoder	HumanEval 70.7% from 7B with 185K samples
RAG Chunking	cAST	+5.6pp over fixed-size on RepoEval
RAG Strategy	AllianceCoder	API signatures > similar code (+20%)
Retriever-Aware Training	Gorilla	Outperforms GPT-4 on API accuracy
Code Embeddings	CodeSage-v2	Best open code embedding model
LoRA for Code	Astraios	LoRA matches FFT at ≥16B scale

Quick Start

Step 1: Prepare Training Data

Merges 4 verified datasets (ToolACE + APIGen-MT + Magicoder + CodeAct) into a unified ChatML format:

pip install datasets

# Test with small sample first
python prepare_data.py --max_per_source 100 --dry_run

# Full run — pushes merged dataset to Hub
python prepare_data.py --output_repo your-username/code-toolcall-sft-data

Dataset composition (~110K examples):

Source	Examples	Purpose
Team-ACE/ToolACE	26K	Tool-calling (single-turn)
Salesforce/APIGen-MT-5k	5K	Multi-turn agentic tool use
Magicoder-OSS-Instruct-75K	~25K (Python)	Python code generation
xingyaoww/code-act	7K	Code-as-action (tools via Python)

Step 2: Add Your Internal Codebase Data

This is the most impactful step. Use the Gorilla/Magicoder pattern:

OSS-Instruct on your code: Sample random snippets from your internal repo → use an LLM (GPT-4o, Claude) to generate instruction-solution pairs seeded from that code
Retriever-aware examples: Include retrieved code context in training prompts so the model learns to use RAG at inference time
Internal API documentation: Convert your docstrings/README into Q&A pairs

See prepare_data.py for the format — add your examples as additional sources.

Step 3: Fine-tune

# Edit train_sft.py to set your dataset and model repo IDs, then:

# Option A: Run on HF Jobs (recommended for A100/H100 hardware)
# Use the hf_jobs API or CLI

# Option B: Run locally with GPU
pip install trl peft transformers datasets trackio accelerate torch
python train_sft.py

Training configuration (from literature):

Base: Qwen/Qwen2.5-Coder-7B-Instruct (Apache 2.0)
Method: LoRA (r=32, alpha=64) on all linear layers
LR: 1e-4 with cosine schedule, 10% warmup
Epochs: 2
Context: 8192 tokens
Loss: Assistant-only (masks user/system/tool tokens)
Hardware: 1x A100-80GB (or 2x A10G-24GB)
Time: ~4-6 hours for 110K examples

Step 4: Index Your Codebase (RAG)

from rag_pipeline import CodebaseIndexer

# Index your internal Python codebase
indexer = CodebaseIndexer(
    "/path/to/your/repo",
    embedding_model="jinaai/jina-embeddings-v2-base-code"  # or codesage-large-v2
)
retriever = indexer.index()

# Save for reuse
retriever.save_index("./my_index")

# Search!
results = retriever.search("authentication token validation", top_k=5)
for chunk, score in results:
    print(f"[{score:.3f}] {chunk.file_path}/{chunk.name}: {chunk.signature}")

RAG pipeline features:

AST-aware chunking (cAST): Functions, methods, classes stay intact — no mid-function cuts
Dual embeddings: Code content + metadata strings (NL descriptions) for hybrid search
AllianceCoder context assembly: API signatures prioritized over full code bodies
In-context dependencies: Automatically extracts imports and class signatures from the current file
Embedding models: Jina-Code-v2 (8K context, 161M) or CodeSage-v2 (best quality, 1.3B)

Step 5: Run the Agent

# Interactive mode
python inference.py \
  --model your-username/qwen25-coder-7b-code-toolcall \
  --repo /path/to/your/codebase \
  --index-dir ./my_index

# Single query
python inference.py \
  --model your-username/qwen25-coder-7b-code-toolcall \
  --repo /path/to/your/codebase \
  --query "Add pagination to the product search endpoint"

The agent uses a ReAct loop:

Pre-fetches relevant code via RAG
Sends query + context to the LLM
If the LLM calls tools → executes them → feeds results back
Repeats until the LLM gives a final answer (max 10 turns)

Recommended Embedding Models

Model	Size	Context	Best For	HF Link
`codesage/codesage-large-v2`	1.3B	2048 tok	Best quality (NL→Code 69.4)	Link
`jinaai/jina-embeddings-v2-base-code`	161M	8192 tok	Long files, 30 languages	Link
`codesage/codesage-small-v2`	130M	2048 tok	Fast, lightweight	Link

Advanced: Full Fine-Tuning (FFT)

For maximum performance, skip LoRA and do full fine-tuning:

# In train_sft.py, remove peft_config and adjust:
LEARNING_RATE = 2e-5    # 10x lower than LoRA
BATCH_SIZE = 1          # Lower to fit in memory
GRAD_ACCUM = 16         # Keep effective batch = 16
# Hardware: 2x A100-80GB minimum for 7B FFT

Per Astraios: FFT slightly outperforms LoRA at 7B scale, but LoRA is within 1% and 30x more parameter-efficient.

Advanced: GRPO Reinforcement Learning (Stage 2)

After SFT, you can further improve the model with GRPO using execution-based rewards:

from trl import GRPOConfig, GRPOTrainer

# Reward function: does the generated code pass unit tests?
def reward_fn(completions, prompts):
    rewards = []
    for code in completions:
        try:
            exec(code, {})  # Sandbox this properly!
            rewards.append(1.0)
        except:
            rewards.append(0.0)
    return rewards

# Train with GRPO
config = GRPOConfig(
    learning_rate=1e-6,
    num_train_epochs=1,
    per_device_train_batch_size=4,
)

File Structure

├── prepare_data.py     # Dataset merging & formatting
├── train_sft.py        # SFT training script (TRL + LoRA)
├── rag_pipeline.py     # AST-aware indexing & retrieval
├── inference.py        # ReAct agent with tool calling
└── README.md           # This file

Requirements

transformers>=4.45.0
trl>=1.0.0
peft>=0.12.0
datasets>=3.0.0
accelerate>=1.0.0
trackio>=0.2.0
torch>=2.0.0
sentence-transformers>=3.0.0  # For RAG embeddings
numpy
scikit-learn  # TF-IDF fallback

Citation

If you use this toolkit, please cite the underlying research:

@article{qwen2.5coder,
  title={Qwen2.5-Coder Technical Report},
  author={Hui, Binyuan and others},
  journal={arXiv:2409.12186},
  year={2024}
}
@article{toolace,
  title={ToolACE: Winning the Points of LLM Function Calling},
  author={Liu, Weiwen and others},
  journal={arXiv:2409.00920},
  year={2024}
}
@article{gorilla,
  title={Gorilla: Large Language Model Connected with Massive APIs},
  author={Patil, Shishir G. and others},
  journal={arXiv:2305.15334},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for rndubs/code-llm-toolkit

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Paper • 2506.15655 • Published Jun 18, 2025

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Paper • 2504.03601 • Published Apr 4, 2025 • 18

What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond

Paper • 2503.20589 • Published Mar 26, 2025 • 1

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 154

ToolACE: Winning the Points of LLM Function Calling

Paper • 2409.00920 • Published Sep 2, 2024 • 2