Extracting a vocal melody from polyphonic audio (i.e., recordings with multiple sounds like vocals plus instruments) requires using advanced signal processing or AI-based source separation techniques. The process can be broken down into several steps depending on the desired precision and available tools.

Principles of Vocal Melody Extraction

Vocal melody extraction is generally achieved through two core stages:

Predominant-F0 Extraction: Identifies the most salient pitch (fundamental frequency) at each time instant in the mixture. This is the likely melody line in most music contexts.

Singing Voice Detection: Determines which segments actually correspond to a singing voice, distinguishing them from instrumental sounds.

Traditional Signal Processing Methods

Earlier approaches used mathematical transforms and harmonic analysis:

Constant-Q Transform (CQT): Converts the audio signal into a frequency scale that aligns with musical pitch perception, commonly used to identify note partials.

Sinusoidal modeling & sparse representation: Enhances tonal components while suppressing percussion for clearer melody contour identification.

Dynamic programming with melodic smoothness constraints: Smooths out pitch transitions to reflect human singing tendencies.

Modern Deep Learning Methods (2025)

Recent advances use neural architectures for superior accuracy:

Quadratic Fluctuation Equation (QFE) Model (2025): Uses iterative pre-emphasis filtering and amplitude modeling to precisely extract vocal melody even under complex polyphony, outperforming CNN- and CRNN-based benchmarks.

Attention U-Net and Voice Activity Networks: These deep neural networks accurately track melodic lines and differentiate vocal sources from instrumental ones.

Graph modeling and harmonic-aware networks: Improve temporal coherence and handle overlapping frequencies in chords.

Practical AI Tools

If your goal is to extract the melody or isolate vocals for remixing or analysis, these modern AI tools can help:

Tool	Description
LALAL.AI	AI-based stem separator that isolates vocals, drums, melody, and more with high fidelity .
PhonicMind	Studio-grade AI stem splitter that isolates vocal or instrumental tracks for melody analysis or acapella generation .
VocalRemover.org	Free browser-based AI splitter that separates vocals efficiently from MP3/WAV files .
ReMusic.ai	AI vocal remover offering quick, high-accuracy extraction with real-time processing .
Melody.ml	Simple online platform powered by Spleeter AI to isolate vocal and instrumental stems .

Workflow for Extracting Vocal Melody

Preprocess audio: Convert to mono or maintain stereo, normalize levels.

Separate sources: Use an AI stem-splitter (e.g., LALAL.AI) to isolate the vocal track.

Pitch detection: Run the isolated vocal through pitch tracking software like Melodia, Essentia, or a PitchYinFFT algorithm to obtain the melody line.

Post-processing: Smooth pitch contours and quantize to note values for MIDI or visualization.

Summary

To extract vocal melody from polyphonic audio:

For research or precision analysis, use models like the QFE deep learning model or Attention U-Net.

For practical remixing or melody isolation, use online tools like LALAL.AI, PhonicMind, or Melody.ml.

Combine source separation with pitch tracking for a clean, musically accurate melody output.

Best open-source tools for vocal melody extraction

Here are the best open-source tools for vocal melody extraction, useful for isolating and analyzing melody lines (especially vocals) in polyphonic audio. These range from classical DSP-based algorithms to modern deep learning implementations.

1. Melodia + audio_to_midi_melodia (by Justin Salamon)

GitHub:
justinsalamon/audio_to_midi_melodia
Approach:
Implements the Melodia algorithm, a spectral salience-based method for predominant melody estimation.
Key features:
Extracts continuous pitch (F0) contour of the melody.
Converts melody to MIDI for further analysis or music transcription.
Uses Vamp plugin interface and Python.
Best for:
Research, melody transcriptions, and creating symbolic music data from polyphonic recordings.
Tech stack:
Python + Vamp plugin + Librosa.

2. Spleeter (by Deezer)

GitHub:

deezer/spleeter

Approach:

source separation

Use for melody extraction:

Key features:

Fast and pretrained TensorFlow models.
Separate into 2, 4, or 5 stems.
High performance on CPU or GPU.

Best for:

clean vocal stems

Tech stack:

TensorFlow (Python).

3. Ultimate Vocal Remover (UVR5)

GitHub:

Anjok07/ultimatevocalremovergui

Approach:

vocal separation

Key features:

GUI and CLI both available.
Supports Windows, macOS, and Linux.
Easy-to-use interface with export options.

Best for:

melody or pitch contour analysis

Tech stack:

PyTorch, ONNX, Demucs models.

4. MelodyExtraction_JDC (Joint Detection & Classification Network)

GitHub:

keums/melodyExtraction_JDC

Approach:

Convolutional Recurrent Neural Network (CRNN) for

joint singing voice detection and pitch estimation

Key features:

High-accuracy vocal melody prediction.
Outputs time–frequency pitch track in Hz.
Trained on multiple melody extraction datasets.

Best for:
Tech stack:

Keras + Librosa + Madmom.

5. Vocal Melody Extraction (Wei-Tsung Lu & Li Su)

GitHub:

s603122001/Vocal-Melody-Extraction

Approach:

Deep learning approach combining

semantic segmentation

and

audio-symbolic domain transfer learning

Key features:

TensorFlow + Keras implementation.
Trained on ISMIR melody datasets.
Includes symbolic-to-audio transfer learning.

Best for:
Tech stack:

TensorFlow, NumPy, Mido.

6. Spotify Basic Pitch

GitHub:

spotify/basic-pitch

Approach:

ML-based

audio-to-MIDI

engine developed by Spotify.

Key features:

Converts polyphonic audio directly to MIDI.
Robust to pitch bending and vibrato.
Fully open source and actively maintained.

Best for:

Converting isolated vocals or full music to editable MIDI melodies.

Tech stack:

TensorFlow, Python.

7. Melody-extraction-with-Melodic-SegNet

GitHub:

bill317996/Melody-extraction-with-melodic-segnet

Approach:

Deep learning

SegNet

architecture optimized for melody extraction.

Output:

Simple text file with timestamp and frequency values (Hz).

Best for:

Batch analysis or dataset creation for MIR research.

Summary Table

Tool	Type	Strength	Ideal Use Case
Melodia + audio_to_midi_melodia	DSP Algorithm	Accurate F0 tracking	Academic melody studies
Spleeter	Deep learning	Fast vocal separation	Preprocessing vocal extraction
Ultimate Vocal Remover (UVR5)	Neural separation	Cleanest stems	Melody isolation for remixing
MelodyExtraction_JDC	CRNN	Joint detection-performance	Deep learning research
Vocal Melody Extraction (Lu & Su)	Deep learning	High semantic detection	Melody + symbolic domain research
Spotify Basic Pitch	ML Transcription	Direct audio-to-MIDI	MIDI generation
Melodic-SegNet	CNN	Lightweight and efficient	Real-time or batch extraction

For both research and production, pairing Spleeter or UVR5 (for isolation) with Melodia or Basic Pitch (for F0 tracking/MIDI conversion) delivers the most effective open-source vocal melody extraction pipeline.

Compare UVR and Spleeter for melody extraction

Ultimate Vocal Remover (UVR) and Spleeter are two of the most widely used open-source tools for vocal and melody isolation, but they differ significantly in architecture, sound fidelity, processing methods, and practical use in melody extraction workflows.

Overview

Feature	Ultimate Vocal Remover (UVR5)	Spleeter (by Deezer)
Release Year	2020 (actively updated, latest: 2025)	2019 (stable but aging)
Core Model	Uses multiple deep learning engines:Demucs v4,MDX-Net,VR Arch, andBS-Roformer SW	UsesU-Net CNNtrained on spectrograms
Domain of Operation	Time-domain and frequency-domain hybrid (Demucs)	Frequency-domain only (spectrogram masking)
Output Quality	Cleaner separation, high accuracy, fewer artifacts	Faster, lower resource use but prone to bleed and artifacts
Performance on Melody Extraction	Excellent — isolates vocals robustly up to 20 kHz with preserved timbre and pitch stability, ideal for downstreamF0 tracking	Decent — effective for simple mixes but cuts off above 11 kHz, causespartial pitch smearing, affecting melody tracking accuracy
Ease of Use	GUI-based (cross-platform); supports advanced model selection, batch processing, and post-processing	CLI-based (Python required), but also used via third-party GUIs like Splitter.ai; more technical setup
Speed	Slower (due to larger models, deep convolutional layers)	Much faster — optimized TensorFlow execution
Ideal Use Case	Professional-gradevocal isolation for remixing or melody extraction	Academic or lightweightsource separation for analysis
Artifact Handling	Excellent; retains reverb and overtones naturally	Moderate; may dull vocal clarity, especially in dense mixes

Technical Comparison

Model Design

UVR: Leverages cutting-edge architectures like Demucs v4 and MDX-Net, combining time-domain recognition (which preserves phase relationships) with frequency-domain precision. This results in more natural and artifact-free isolated vocals suitable for accurate pitch contour analysis.

Spleeter: Employs an encoder-decoder U-Net structure that creates spectrogram masks per instrument class, then reconstructs the waveform through inverse STFT. It performs well on simple mixes but struggles with overlapping harmonics or high-frequency details.

2.Vocal Pitch Accuracy

UVR retains fine-grained harmonics and high-frequency formant cues critical for precise fundamental frequency (F0) tracking — essential for melody extraction workflows like Melodia or CREPE.

Spleeter tends to smear pitch or produce attenuated harmonics around 11 kHz, often resulting in less accurate pitch traces or “blended” tones when analyzed with downstream melody extractors.

3.Workflow Integration for Melody Extraction

Best pipeline for UVR:UVR (Demucs v4 Model) → isolate vocals → Melodia / CREPE / Basic Pitch for F0 tracking → export to MIDI.

Best pipeline for Spleeter:Spleeter (2-stem mode: vocals/accompaniment) → librosa.pyin or Melodia → post-process noise removal.

Verdict

For melody extraction, Ultimate Vocal Remover (UVR) greatly outperforms Spleeter:

Maintains higher harmonic integrity and less distortion, leading to cleaner F0 curves.
Offers modern AI models (Demucs v4, MDX-Net) trained on richer datasets.
Includes a GUI and model flexibility ideal for professional or content-creation workflows.

Spleeter remains valuable for lightweight, high-speed tasks or educational contexts, but in 2025 it is viewed as a baseline tool compared to UVR’s state-of-the-art separation quality.

Header Ads Widget

Principles of Vocal Melody Extraction

Vocal melody extraction is generally achieved through two core stages:

Predominant-F0 Extraction: Identifies the most salient pitch (fundamental frequency) at each time instant in the mixture. This is the likely melody line in most music contexts.

Singing Voice Detection: Determines which segments actually correspond to a singing voice, distinguishing them from instrumental sounds.

Traditional Signal Processing Methods

Earlier approaches used mathematical transforms and harmonic analysis:

Constant-Q Transform (CQT): Converts the audio signal into a frequency scale that aligns with musical pitch perception, commonly used to identify note partials.

Sinusoidal modeling & sparse representation: Enhances tonal components while suppressing percussion for clearer melody contour identification.

Dynamic programming with melodic smoothness constraints: Smooths out pitch transitions to reflect human singing tendencies.

Modern Deep Learning Methods (2025)

Recent advances use neural architectures for superior accuracy:

Quadratic Fluctuation Equation (QFE) Model (2025): Uses iterative pre-emphasis filtering and amplitude modeling to precisely extract vocal melody even under complex polyphony, outperforming CNN- and CRNN-based benchmarks.

Attention U-Net and Voice Activity Networks: These deep neural networks accurately track melodic lines and differentiate vocal sources from instrumental ones.

Graph modeling and harmonic-aware networks: Improve temporal coherence and handle overlapping frequencies in chords.

Practical AI Tools

If your goal is to extract the melody or isolate vocals for remixing or analysis, these modern AI tools can help:

Workflow for Extracting Vocal Melody

Summary

Best open-source tools for vocal melody extraction

1. Melodia + audio_to_midi_melodia (by Justin Salamon)

2. Spleeter (by Deezer)

3. Ultimate Vocal Remover (UVR5)

5. Vocal Melody Extraction (Wei-Tsung Lu & Li Su)

6. Spotify Basic Pitch

7. Melody-extraction-with-Melodic-SegNet

Summary Table

Compare UVR and Spleeter for melody extraction

Overview

Technical Comparison

Verdict

Post a Comment

Post a Comment

Translate

About Us

Contact Form

How to extract vocal melody from polyphonic audio

Principles of Vocal Melody Extraction

Vocal melody extraction is generally achieved through two core stages:

Predominant-F0 Extraction: Identifies the most salient pitch (fundamental frequency) at each time instant in the mixture. This is the likely melody line in most music contexts.​

Singing Voice Detection: Determines which segments actually correspond to a singing voice, distinguishing them from instrumental sounds.​

Traditional Signal Processing Methods

Earlier approaches used mathematical transforms and harmonic analysis:

Constant-Q Transform (CQT): Converts the audio signal into a frequency scale that aligns with musical pitch perception, commonly used to identify note partials.​

Sinusoidal modeling & sparse representation: Enhances tonal components while suppressing percussion for clearer melody contour identification.​

Dynamic programming with melodic smoothness constraints: Smooths out pitch transitions to reflect human singing tendencies.​

Modern Deep Learning Methods (2025)

Recent advances use neural architectures for superior accuracy:

Quadratic Fluctuation Equation (QFE) Model (2025): Uses iterative pre-emphasis filtering and amplitude modeling to precisely extract vocal melody even under complex polyphony, outperforming CNN- and CRNN-based benchmarks.​

Attention U-Net and Voice Activity Networks: These deep neural networks accurately track melodic lines and differentiate vocal sources from instrumental ones.​

Graph modeling and harmonic-aware networks: Improve temporal coherence and handle overlapping frequencies in chords.​

Practical AI Tools

If your goal is to extract the melody or isolate vocals for remixing or analysis, these modern AI tools can help:

Workflow for Extracting Vocal Melody

Summary

Best open-source tools for vocal melody extraction

1. Melodia + audio_to_midi_melodia (by Justin Salamon)

2. Spleeter (by Deezer)

3. Ultimate Vocal Remover (UVR5)

5. Vocal Melody Extraction (Wei-Tsung Lu & Li Su)

6. Spotify Basic Pitch

7. Melody-extraction-with-Melodic-SegNet

Summary Table

Compare UVR and Spleeter for melody extraction

Overview

Technical Comparison

Verdict

Post a Comment

Post a Comment

Translate

Contact Form

Predominant-F0 Extraction: Identifies the most salient pitch (fundamental frequency) at each time instant in the mixture. This is the likely melody line in most music contexts.

Singing Voice Detection: Determines which segments actually correspond to a singing voice, distinguishing them from instrumental sounds.

Constant-Q Transform (CQT): Converts the audio signal into a frequency scale that aligns with musical pitch perception, commonly used to identify note partials.

Sinusoidal modeling & sparse representation: Enhances tonal components while suppressing percussion for clearer melody contour identification.

Dynamic programming with melodic smoothness constraints: Smooths out pitch transitions to reflect human singing tendencies.

Quadratic Fluctuation Equation (QFE) Model (2025): Uses iterative pre-emphasis filtering and amplitude modeling to precisely extract vocal melody even under complex polyphony, outperforming CNN- and CRNN-based benchmarks.

Attention U-Net and Voice Activity Networks: These deep neural networks accurately track melodic lines and differentiate vocal sources from instrumental ones.

Graph modeling and harmonic-aware networks: Improve temporal coherence and handle overlapping frequencies in chords.