In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Paper β’ 2604.22817 β’ Published
Automatic perfect song lyric acquisition and synchronization.
Produces word-level synchronized lyrics with sub-10ms precision from any audio file.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β lyric-sync Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Input β β Demucs β β WhisperX β β Output β β
β β Audio βββββΆβ Vocals βββββΆβTranscribeβββββΆβ Synced β β
β β (mix) β βSeparationβ β + Timing β β Lyrics β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β β² β² β
β β β β β
β βΌ β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β βAcoustID βββββΆβ Fetch β βAlign ASR β β Refine β β
β β Identify β βReference βββββΆβto Lyrics βββββΆβ Onsets/ β β
β β Song β β Lyrics β β(transfer β β Offsets β β
β ββββββββββββ ββββββββββββ β timings) β ββββββββββββ β
β β ββββββββββββ β
β βΌ (fallback) β
β ββββββββββββ β
β βTranscriptβ β
β β Search β β
β ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
htdemucs_ft (best available: ~9.2 dB SDR on MUSDB18-HQ)difflib.SequenceMatcher (LCS-based global alignment)# Core (separation + refinement)
pip install lyric-sync
# With WhisperX transcription (recommended)
pip install lyric-sync[whisperx]
# With song identification
pip install lyric-sync[identify]
# Everything
pip install lyric-sync[all]
# System dependency: chromaprint (for AcoustID fingerprinting)
# Ubuntu/Debian:
sudo apt-get install chromaprint-tools ffmpeg
# macOS:
brew install chromaprint ffmpeg
# Full automatic (identify + fetch lyrics + sync)
lyric-sync song.mp3 --acoustid-key YOUR_KEY -v
# With known metadata (faster, skips fingerprinting)
lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc
# JSON output for apps
lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json
# ASS karaoke subtitles
lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass
# CPU-only processing (slower but no GPU needed)
lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
from lyric_sync import LyricSyncPipeline
# Initialize
pipeline = LyricSyncPipeline(
acoustid_key="YOUR_ACOUSTID_KEY", # optional
device="cuda", # or "cpu"
)
# Full auto
result = pipeline.sync("song.mp3")
# With known metadata
result = pipeline.sync(
"song.mp3",
artist="Radiohead",
title="Creep",
)
# Access results
print(result.song) # SongIdentification(title=..., artist=...)
print(result.quality_score) # 0.85 (0-1 quality estimate)
# Export
print(result.to_lrc()) # Enhanced LRC with word-level timestamps
print(result.to_json()) # JSON array of {word, start, end, confidence}
print(result.to_srt()) # SRT subtitles
print(result.to_ass()) # ASS karaoke with \k tags
from lyric_sync.separate import VocalSeparator
from lyric_sync.transcribe import transcribe_vocals
from lyric_sync.lyrics import fetch_lyrics
from lyric_sync.align import align_words
from lyric_sync.refine import refine_timings
# 1. Separate vocals
separator = VocalSeparator(device="cuda")
vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")
# 2. Transcribe
transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")
# 3. Fetch lyrics
lyrics = fetch_lyrics(artist="Radiohead", title="Creep")
# 4. Align
aligned_words, stats = align_words(
asr_words=transcript.words,
ref_words=lyrics.words,
)
# 5. Refine
refined_words = refine_timings(vocals_full, sr_full, aligned_words)
| Format | Description | Use Case |
|---|---|---|
lrc (enhanced) |
[MM:SS.cc] <MM:SS.cc> word ... |
Music players with word-level sync |
lrc_standard |
[MM:SS.cc] Line of text |
Standard music players |
json |
[{"word": ..., "start": ..., "end": ...}] |
Apps, programmatic use |
srt |
Standard SRT subtitles | Video players |
ass |
ASS with \kf karaoke tags |
Karaoke / video editing |
| Variable | Description |
|---|---|
ACOUSTID_API_KEY |
AcoustID API key (free, register at acoustid.org) |
GENIUS_TOKEN |
Genius API token (free, for plain lyrics fallback) |
| Component | GPU (CUDA) | CPU |
|---|---|---|
| Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower |
| WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower |
| Total | ~10-12 GB VRAM | ~16 GB RAM |
| Processing time (4min song) | ~30-60s | ~5-10 min |
| Backend | Quality (singing) | Speed | Dependencies |
|---|---|---|---|
| WhisperX β | Best (phoneme alignment) | Fast (batched) | whisperx |
| Whisper (pipeline) | Good (attention-based) | Fast | transformers |
| Granite Speech | Unknown (speech-trained) | Medium | transformers |
The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps on the correct lyrics. We solve this with sequence alignment:
equal blocks β direct timestamp copy. For replace β linear interpolationAfter alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:
MIT
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.