ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark.

Table 1: Comparison of our ThinkSound foundation model with existing video-to-audio baselines on the VGGSound test set. ↓ indicates lower is better, ↑ indicates higher is better. For MOS, we show the mean and variance of the MOS scores. † indicates that the method does not use text for inference.
Method	Objective Metrics	Subjective Metrics	Efficiency
GT	-	-	-	0.55	0.28	0.45	4.37±0.21	4.56±0.19	-	-
See&Hear	118.95	2.26	2.30	1.20	0.32	0.35	2.75±1.08	2.87±0.99	415M	19.42
V-AURA†	46.99	2.23	1.83	0.65	0.23	0.37	3.42±1.03	3.20±1.17	695M	14.00
FoleyCrafter	39.15	2.06	1.89	1.21	0.41	0.34	3.08±1.21	2.63±0.88	1.20B	3.84
Frieren†	74.96	2.55	2.64	1.00	0.37	0.34	3.27±1.11	2.95±1.09	159M	-
V2A-Mapper†	48.10	2.50	2.34	1.23	0.38	0.32	3.31±1.02	3.16±1.04	229M	-
MMAudio	43.26	1.65	1.40	0.44	0.31	0.40	3.84±0.89	3.97±0.82	1.03B	3.01
ThinkSound	34.56	1.52	1.32	0.46	0.33	0.46	4.02±0.73	4.18±0.79	1.30B	1.07
w/o CoT Reasoning	39.84	1.59	1.40	0.48	0.29	0.41	3.91±0.83	4.04±0.75	1.30B	0.98

ThinkSound outperforms all baselines across most objective metrics and all subjective metrics. Compared to the strongest baseline (MMAudio), our model achieves substantial improvements in audio quality and semantic alignment, while maintaining comparable temporal synchronization performance on the objective synchronization metric.

Ablation Studies

To better understand the contribution of each component in ThinkSound and to validate the effectiveness of our design choices, we conduct comprehensive ablation studies on the VGGSound test set. We mainly focus on: (1) text encoding strategies and (2) multi-modal integration mechanisms. For more ablation and exploratory results, refer to Supplementary Materials D.

Text Encoding Strategies

We evaluate different text encoding strategies with or without CoT reasoning. The results are shown in Table 1. First, CoT reasoning substantially improves audio fidelity—for example, FD improves from 39.84 to 37.65 when comparing CLIP-only to T5 with CoT. Second, integrating contrastive features from CLIP with contextual reasoning from T5 further improves performance, reducing both KL_PaSST and KL_PaNNs.

Table 2: Comparison of text encoder fusion strategies (CLAP = CLAP_CoT)
Method	FD ↓	KL_PaSST ↓	KL_PaNNs ↓	DeSync ↓	CLAP ↑
CLIP	39.84	1.59	1.40	0.48	0.41
T5 (CoT)	37.65	1.54	1.35	0.46	0.44
CLIP + T5	34.56	1.52	1.32	0.46	0.46

Multi-Modal Integration Mechanisms

We investigate different ways to integrate video and audio features before feeding them into the single-stream transformer. As shown in Table 2, element-wise addition of video and audio features performs better than audio-only input, especially in synchronization (DeSync reduced from 0.50 to 0.46). Moreover, the gated fusion mechanism outperforms both alternatives across all metrics.

Table 3: Comparison of multi-modal integration mechanisms
Integration	FD ↓	KL_PaSST ↓	KL_PaNNs ↓	DeSync ↓	CLAP ↑
audio only	37.13	1.58	1.37	0.50	0.43
linear video	38.96	1.58	1.38	0.46	0.45
gated video	34.56	1.52	1.32	0.46	0.46

Impact of Model Size

We compare three model sizes of ThinkSound: Large (1.3B), Medium (724M), and Small (533M). The results are shown in Table 3. The Large model achieves the best performance across all metrics. As model size decreases, performance degrades substantially, highlighting the necessity of adequate model capacity for effective audio generation.

Table 4: Impact of model size results.
Size	FD ↓	KL_PaSST ↓	KL_PaNNs ↓	DeSync ↓	CLAP_CoT ↑
Small	40.80	1.64	1.38	0.46	0.41
Medium	36.80	1.56	1.34	0.46	0.44
Large	34.56	1.52	1.32	0.46	0.46

Code and Dataset

The code and dataset will be released soon.

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Abstract

Cooperate with video generation models

Veo3 + ThinkSound

Sora + ThinkSound

Movie Gen + ThinkSound

Video-to-Audio Generation Comparisons on VGGSound (In-distribution)

Video to Audio Generation Comparisons on Movie Gen Audio Bench (Out-of-Distribution)

Interactive Step-by-Step Foley Creation

Video-to-Audio Generation->Object-Focused Audio Generation->Audio Inpainting

Video-to-Audio Generation->Object-Focused Audio Generation->Audio Editing (Add Action)

Experiments

Main Results

Ablation Studies

Text Encoding Strategies

Multi-Modal Integration Mechanisms

Impact of Model Size

Code and Dataset

Method	Objective Metrics						Subjective Metrics		Efficiency
	FD ↓	KL_PaSST ↓	KL_PaNNs ↓	DeSync ↓	CLAP_cap ↑	CLAP_CoT ↑	MOS-Q ↑	MOS-A ↑	Params	Time(s) ↓
GT	-	-	-	0.55	0.28	0.45	4.37±0.21	4.56±0.19	-	-
See&Hear	118.95	2.26	2.30	1.20	0.32	0.35	2.75±1.08	2.87±0.99	415M	19.42
V-AURA†	46.99	2.23	1.83	0.65	0.23	0.37	3.42±1.03	3.20±1.17	695M	14.00
FoleyCrafter	39.15	2.06	1.89	1.21	0.41	0.34	3.08±1.21	2.63±0.88	1.20B	3.84
Frieren†	74.96	2.55	2.64	1.00	0.37	0.34	3.27±1.11	2.95±1.09	159M	-
V2A-Mapper†	48.10	2.50	2.34	1.23	0.38	0.32	3.31±1.02	3.16±1.04	229M	-
MMAudio	43.26	1.65	1.40	0.44	0.31	0.40	3.84±0.89	3.97±0.82	1.03B	3.01
ThinkSound	34.56	1.52	1.32	0.46	0.33	0.46	4.02±0.73	4.18±0.79	1.30B	1.07
w/o CoT Reasoning	39.84	1.59	1.40	0.48	0.29	0.41	3.91±0.83	4.04±0.75	1.30B	0.98