ThinkSound

如果您覺得這個項目有幫助，
歡迎在 GitHub 上給予星標 ⭐ 支持！

ThinkSound 是一個統一的 Any2Audio 生成框架，利用鏈式思維（Chain-of-Thought, CoT）推理引導的流匹配技術。 PyTorch 實現的多模態音訊生成與編輯：可從影片、文字與音訊產生或編輯音訊，並由多模態大型語言模型（MLLMs）的逐步推理驅動。

Teaser

📰 最新消息

2025.11.25 🔥PrismAudio 線上示範已上線 - 現在試用！
2025.11.25 🔥PrismAudio 論文已發佈於 arXiv，首創多維 CoT-RL 架構的影像轉音訊生成！
2025.09.19 🎉 ThinkSound 已被 NeurIPS 2025 主會議 接收！
2025.09.01 我們的 AudioCoT 資料集已開源，現可於 Hugging Face 取得！
2025.07.17 🧠 支援微調：訓練與微調程式碼現已公開，並附有詳細使用說明，協助你用自有資料自訂並擴充 ThinkSound。
2025.07.15 📦 安裝與使用更簡化：PyPI 依賴便於跨平台安裝；Windows .bat 腳本自動建立環境與執行腳本。
2025.07.08 🔧 重大更新：模型輕量化並優化記憶體與 GPU 使用，現支援大規模高吞吐音訊生成！
2025.07.01 Hugging Face Spaces 線上示範與 ModelScope 提供互動體驗！
2025.07.01 發布推論腳本與網頁介面；
2025.06 ThinkSound 論文已發佈於 arXiv！
2025.06 線上示範已上線 - 現在試用！

---

🚀 特色功能

Any2Audio：由任意模態——影片、文字、音訊或其組合——生成音訊。
Video-to-Audio SOTA：多項 V2A 基準測試達到最新最強表現。
CoT 驅動推理：透過 MLLM 的 Chain-of-Thought 推理，實現組合性與可控音訊生成。
互動式物件為中心編輯：點選視覺物件或輸入文字指令，即可細緻修飾或編輯特定音效事件。
統一式架構：單一基礎模型支援生成、編輯與互動式工作流程。

---

✨ 方法概述

ThinkSound 將音訊生成與編輯拆解為三個互動階段，皆由 MLLM 驅動的 Chain-of-Thought（CoT）推理指導：

擬音生成：根據影片產生語意與時間對齊的基礎音景。
以物件為中心的細化：透過點選或標註影片中指定物件，細緻調整或新增音效。
目標導向音訊編輯：利用高階自然語言指令修改已生成音訊。

⚡ 快速開始

環境準備：

git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
conda create -n thinksound python=3.10
conda activate thinksound
pip install thinksound
conda install -y -c conda-forge 'ffmpeg<7'
Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
To improve inference and training speed, you may optionally install a FlashAttention backend compatible with your system and PyTorch version.

✅ Windows 提示：

Windows 使用者只需執行 setup_windows.bat（或雙擊它）即可自動建立 conda 環境、安裝所有依賴（包括 FFmpeg），並下載預訓練模型——無需手動設置。

在執行腳本前，請確保 conda 和 git 已經安裝並且可在系統 PATH 中使用。

▶️ 執行示範

#### Linux/macOS

chmod +x scripts/demo.sh
./scripts/demo.sh   <CoT description> [use-half]</code></pre>
#### <strong>Windows</strong></p><p>您也可以改用提供的 <code>.bat</code> 腳本：</p><pre><code class="language-bash">.\scripts\demo.bat <path-to-your-demo-video> <title> <CoT description> [use-half]</code></pre>
<strong>注意：</strong></p><ul><li><code><path-to-your-demo-video></code>：單一影片的路徑</li>
<li><code>[use-half]</code>（可選）：在結尾加上 use-half 以啟用半精度特徵提取。</li></p><p></ul>---</p><h3>📦 批次推論</h3></p><p>#### <strong>Linux/macOS</strong></p><pre><code class="language-bash">chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh <video_path> <csv_path> <save_path (optional)> [use-half]</code></pre>
#### <strong>Windows</strong></p><p>請使用等效的 <code>.bat</code> 腳本：</p><pre><code class="language-bash">.\scripts\eval_batch.bat <video_path> <csv_path> <save_path (optional)> [use-half]</code></pre>
<strong>注意：</strong></p><ul><li><code><video_path></code>：包含所有待處理 .mp4 影片的根目錄路徑（所有影片必須長度相同）。</li>
<li><code><csv_path></code>：每個影片的文字提示 CSV 檔案（格式請參考 <code>demo_test.csv</code>）。</li>
<li><code><save_path></code>（可選）：產生的音訊儲存位置。預設為 <code>results/features</code>。</li>
<li><code>[use-half]</code>（可選）：最後加上 use-half，可啟用半精度特徵擷取。</li></p><p></ul>---</p><h3>網頁介面使用方式</h3></p><p>若需互動式操作，可啟動 Gradio 網頁介面：</p><pre><code class="language-bash">python app.py</code></pre></p><h2>🏋️ 訓練模型</h2></p><p>請參閱 <a href="https://raw.githubusercontent.com/FunAudioLLM/ThinkSound/master/docs/Training.md" target="_blank" rel="noopener noreferrer"><code>Training.md</code></a></p><hr></p><h2>📝 待辦事項與未來規劃</h2>
<ul><li>- [ ] 發佈涵蓋多領域的更強大基礎模型，以提供更具吸引力和沉浸感的擬音創作</li>
<li>- [ ] 增加對其他模態與下游任務的支援</li>
<li>- [ ] 發佈不同規模的模型</li>
<li>- [x] 開源 AudioCoT 數據集與自動化流程</li>
<li>- [x] 發佈 ThinkSound 模型的訓練腳本</li>
<li>- [x] 提供適合初學者的 Windows 快速入門 README</li>
</ul>---</p><h2>📄 授權條款</h2></p><p>本專案以 Apache 2.0 授權條款釋出。</p><blockquote><strong>注意：</strong></blockquote>
<blockquote>本程式碼、模型與數據集<strong>僅供研究與教育用途</strong>。</blockquote>
<blockquote><strong>禁止商業使用。</strong></blockquote>
<blockquote>如需商業授權，請聯絡作者。</blockquote></p><p><strong>📦 第三方元件</strong></p><ul><li><strong>Stable Audio Open VAE</strong>（由 Stability AI 提供）：</li>
  </ul>本倉庫包含來自 <a href="https://huggingface.co/stabilityai/stable-audio-open-1.0/" target="_blank" rel="noopener noreferrer">Stable Audio Open</a>、經微調的 VAE，依據 <a href="https://raw.githubusercontent.com/FunAudioLLM/ThinkSound/master/./third_party/LICENSE_StabilityAI.md" target="_blank" rel="noopener noreferrer">Stability AI Community License</a> 授權。
  <strong>商業使用及再分發需事先取得 Stability AI 的許可。</strong></p><ul><li>📘 <strong>所有其他程式碼與模型</strong>皆採用 Apache License 2.0 釋出。</li></p><p></ul>---</p><h2>致謝</h2></p><p>特別感謝：</p><ul><li><strong>stable-audio-tools</strong>（由 Stability AI 提供）：</li>
</ul>提供易於使用的音訊生成框架，以及 VAE 模組和權重。
<ul><li><strong>MMAudio</strong>：</li>
  </ul>在音訊領域中實現了 MM-DiT 主幹架構。</p><hr></p><h2>📖 引用</h2></p><p>如果您在研究或工作中覺得 ThinkSound 有幫助，請引用我們的論文：</p><pre><code class="language-bibtex">@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}</code></pre></p><hr></p><h2>📬 Contact</h2></p><p>
✨ Feel free to <a href="https://github.com/liuhuadai/ThinkSound/issues" target="_blank" rel="noopener noreferrer">open an issue</a> or contact us via email (<a href="https://raw.githubusercontent.com/FunAudioLLM/ThinkSound/master/mailto:liuhuadai@zju.edu.cn" target="_blank" rel="noopener noreferrer">liuhuadai@zju.edu.cn</a>) if you have any questions or suggestions!</p><p>

---


Tranlated By <a href="https://github.com/OpenAiTx/OpenAiTx" target="_blank" rel="noopener noreferrer">Open Ai Tx</a> | Last indexed: 2026-01-07


---</p>
        </div>
        
        <div class="original-link">
            <strong>Original README:</strong> <a href="https://raw.githubusercontent.com/FunAudioLLM/ThinkSound/master/README.md" target="_blank" rel="noopener noreferrer">View on GitHub</a>
        </div>
    </div>
    
    <div class="footer">
        <p>Translated by <a href="https://github.com/OpenAiTx/OpenAiTx" target="_blank" rel="noopener noreferrer">OpenAiTx</a> | 
        Last updated: 2026-01-07 
    </div>
    
</body>
</html>