Principles of Vocal Melody Extraction
Vocal melody extraction is generally achieved through two core stages:
Predominant-F0 Extraction: Identifies the most salient pitch (fundamental frequency) at each time instant in the mixture. This is the likely melody line in most music contexts.
Singing Voice Detection: Determines which segments actually correspond to a singing voice, distinguishing them from instrumental sounds.
Traditional Signal Processing Methods
Earlier approaches used mathematical transforms and harmonic analysis:
Constant-Q Transform (CQT): Converts the audio signal into a frequency scale that aligns with musical pitch perception, commonly used to identify note partials.
Sinusoidal modeling & sparse representation: Enhances tonal components while suppressing percussion for clearer melody contour identification.
Dynamic programming with melodic smoothness constraints: Smooths out pitch transitions to reflect human singing tendencies.
Modern Deep Learning Methods (2025)
Recent advances use neural architectures for superior accuracy:
Quadratic Fluctuation Equation (QFE) Model (2025): Uses iterative pre-emphasis filtering and amplitude modeling to precisely extract vocal melody even under complex polyphony, outperforming CNN- and CRNN-based benchmarks.
Attention U-Net and Voice Activity Networks: These deep neural networks accurately track melodic lines and differentiate vocal sources from instrumental ones.
Graph modeling and harmonic-aware networks: Improve temporal coherence and handle overlapping frequencies in chords.
Practical AI Tools
If your goal is to extract the melody or isolate vocals for remixing or analysis, these modern AI tools can help:
Tool | Description |
|---|---|
LALAL.AI | AI-based stem separator that isolates vocals, drums, melody, and more with high fidelity . |
PhonicMind | Studio-grade AI stem splitter that isolates vocal or instrumental tracks for melody analysis or acapella generation . |
VocalRemover.org | Free browser-based AI splitter that separates vocals efficiently from MP3/WAV files . |
ReMusic.ai | AI vocal remover offering quick, high-accuracy extraction with real-time processing . |
Melody.ml | Simple online platform powered by Spleeter AI to isolate vocal and instrumental stems . |
Workflow for Extracting Vocal Melody
Preprocess audio: Convert to mono or maintain stereo, normalize levels.
Separate sources: Use an AI stem-splitter (e.g., LALAL.AI) to isolate the vocal track.
Pitch detection: Run the isolated vocal through pitch tracking software like Melodia, Essentia, or a PitchYinFFT algorithm to obtain the melody line.
Post-processing: Smooth pitch contours and quantize to note values for MIDI or visualization.
Summary
To extract vocal melody from polyphonic audio:
For research or precision analysis, use models like the QFE deep learning model or Attention U-Net.
For practical remixing or melody isolation, use online tools like LALAL.AI, PhonicMind, or Melody.ml.
Combine source separation with pitch tracking for a clean, musically accurate melody output.
Best open-source tools for vocal melody extraction
1. Melodia + audio_to_midi_melodia (by Justin Salamon)
- GitHub:
justinsalamon/audio_to_midi_melodia- Approach:
Implements the Melodia algorithm, a spectral salience-based method for predominant melody estimation.
Key features:
- Extracts continuous pitch (F0) contour of the melody.
- Converts melody to MIDI for further analysis or music transcription.
- Uses Vamp plugin interface and Python.
- Best for:
Research, melody transcriptions, and creating symbolic music data from polyphonic recordings.- Tech stack:
Python + Vamp plugin + Librosa.
- GitHub: justinsalamon/audio_to_midi_melodia
- Approach: Implements the Melodia algorithm, a spectral salience-based method for predominant melody estimation.
- Extracts continuous pitch (F0) contour of the melody.
- Converts melody to MIDI for further analysis or music transcription.
- Uses Vamp plugin interface and Python.
- Best for: Research, melody transcriptions, and creating symbolic music data from polyphonic recordings.
- Tech stack: Python + Vamp plugin + Librosa.
2. Spleeter (by Deezer)
- GitHub: deezer/spleeter
- Approach: Deep learning-based source separation, splitting audio into stems (vocals, bass, drums).
- Use for melody extraction: Once vocals are separated, use pitch tracking to extract melody.
- Fast and pretrained TensorFlow models.
- Separate into 2, 4, or 5 stems.
- High performance on CPU or GPU.
- Best for: Producers or researchers extracting clean vocal stems before melody tracking.
- Tech stack: TensorFlow (Python).
3. Ultimate Vocal Remover (UVR5)
- GitHub: Anjok07/ultimatevocalremovergui
- Approach: Combines multiple open-source neural models (MDX-Net, Demucs, VR Arch) for high-fidelity vocal separation.
- GUI and CLI both available.
- Supports Windows, macOS, and Linux.
- Easy-to-use interface with export options.
- Best for: Extracting isolated vocals for further melody or pitch contour analysis.
- Tech stack: PyTorch, ONNX, Demucs models.
4. MelodyExtraction_JDC (Joint Detection & Classification Network)
- GitHub: keums/melodyExtraction_JDC
- Approach: Convolutional Recurrent Neural Network (CRNN) for joint singing voice detection and pitch estimation.
- High-accuracy vocal melody prediction.
- Outputs time–frequency pitch track in Hz.
- Trained on multiple melody extraction datasets.
- Best for: Deep-learning researchers and developers building melody tracking models.
- Tech stack: Keras + Librosa + Madmom.
5. Vocal Melody Extraction (Wei-Tsung Lu & Li Su)
- GitHub: s603122001/Vocal-Melody-Extraction
- Approach: Deep learning approach combining semantic segmentation and audio-symbolic domain transfer learning.
Key features:
- TensorFlow + Keras implementation.
- Trained on ISMIR melody datasets.
- Includes symbolic-to-audio transfer learning.
- Best for: Research-heavy applications with labeled datasets and experiments in transfer learning.
- Tech stack: TensorFlow, NumPy, Mido.
6. Spotify Basic Pitch
- GitHub: spotify/basic-pitch
- Approach: ML-based audio-to-MIDI engine developed by Spotify.
Key features:
- Converts polyphonic audio directly to MIDI.
- Robust to pitch bending and vibrato.
- Fully open source and actively maintained.
- Best for: Converting isolated vocals or full music to editable MIDI melodies.
- Tech stack: TensorFlow, Python.
7. Melody-extraction-with-Melodic-SegNet
- GitHub: bill317996/Melody-extraction-with-melodic-segnet
- Approach: Deep learning SegNet architecture optimized for melody extraction.
- Output: Simple text file with timestamp and frequency values (Hz).
- Best for: Batch analysis or dataset creation for MIR research.
Summary Table
Tool | Type | Strength | Ideal Use Case |
|---|---|---|---|
Melodia + audio_to_midi_melodia | DSP Algorithm | Accurate F0 tracking | Academic melody studies |
Spleeter | Deep learning | Fast vocal separation | Preprocessing vocal extraction |
Ultimate Vocal Remover (UVR5) | Neural separation | Cleanest stems | Melody isolation for remixing |
MelodyExtraction_JDC | CRNN | Joint detection-performance | Deep learning research |
Vocal Melody Extraction (Lu & Su) | Deep learning | High semantic detection | Melody + symbolic domain research |
Spotify Basic Pitch | ML Transcription | Direct audio-to-MIDI | MIDI generation |
Melodic-SegNet | CNN | Lightweight and efficient | Real-time or batch extraction |
For both research and production, pairing Spleeter or UVR5 (for isolation) with Melodia or Basic Pitch (for F0 tracking/MIDI conversion) delivers the most effective open-source vocal melody extraction pipeline.
Compare UVR and Spleeter for melody extraction
Overview
Feature | Ultimate Vocal Remover (UVR5) | Spleeter (by Deezer) |
|---|---|---|
Release Year | 2020 (actively updated, latest: 2025) | 2019 (stable but aging) |
Core Model | Uses multiple deep learning engines:Demucs v4,MDX-Net,VR Arch, andBS-Roformer SW | UsesU-Net CNNtrained on spectrograms |
Domain of Operation | Time-domain and frequency-domain hybrid (Demucs) | Frequency-domain only (spectrogram masking) |
Output Quality | Cleaner separation, high accuracy, fewer artifacts | Faster, lower resource use but prone to bleed and artifacts |
Performance on Melody Extraction | Excellent — isolates vocals robustly up to 20 kHz with preserved timbre and pitch stability, ideal for downstreamF0 tracking | Decent — effective for simple mixes but cuts off above 11 kHz, causespartial pitch smearing, affecting melody tracking accuracy |
Ease of Use | GUI-based (cross-platform); supports advanced model selection, batch processing, and post-processing | CLI-based (Python required), but also used via third-party GUIs like Splitter.ai; more technical setup |
Speed | Slower (due to larger models, deep convolutional layers) | Much faster — optimized TensorFlow execution |
Ideal Use Case | Professional-gradevocal isolation for remixing or melody extraction | Academic or lightweightsource separation for analysis |
Artifact Handling | Excellent; retains reverb and overtones naturally | Moderate; may dull vocal clarity, especially in dense mixes |
Technical Comparison
- Model Design
- UVR: Leverages cutting-edge architectures like Demucs v4 and MDX-Net, combining time-domain recognition (which preserves phase relationships) with frequency-domain precision. This results in more natural and artifact-free isolated vocals suitable for accurate pitch contour analysis.
- Spleeter: Employs an encoder-decoder U-Net structure that creates spectrogram masks per instrument class, then reconstructs the waveform through inverse STFT. It performs well on simple mixes but struggles with overlapping harmonics or high-frequency details.
- UVR retains fine-grained harmonics and high-frequency formant cues critical for precise fundamental frequency (F0) tracking — essential for melody extraction workflows like Melodia or CREPE.
- Spleeter tends to smear pitch or produce attenuated harmonics around 11 kHz, often resulting in less accurate pitch traces or “blended” tones when analyzed with downstream melody extractors.
- Best pipeline for UVR:UVR (Demucs v4 Model) → isolate vocals → Melodia / CREPE / Basic Pitch for F0 tracking → export to MIDI.
- Best pipeline for Spleeter:Spleeter (2-stem mode: vocals/accompaniment) → librosa.pyin or Melodia → post-process noise removal.
Verdict
For melody extraction, Ultimate Vocal Remover (UVR) greatly outperforms Spleeter:
- Maintains higher harmonic integrity and less distortion, leading to cleaner F0 curves.
- Offers modern AI models (Demucs v4, MDX-Net) trained on richer datasets.
- Includes a GUI and model flexibility ideal for professional or content-creation workflows.
Spleeter remains valuable for lightweight, high-speed tasks or educational contexts, but in 2025 it is viewed as a baseline tool compared to UVR’s state-of-the-art separation quality.

Post a Comment