Web Analytics

ZipVoice

⭐ 661 stars English by k2-fsa

🌐 Language

ZipVoice⚡

Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Overview

ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.

1. Key features

2. Model variants

Model Name Description Paper Demo
ZipVoice The basic model supporting zero-shot single-speaker TTS in both Chinese and English.
ZipVoice-Distill The distilled version of ZipVoice, featuring improved speed with minimal performance degradation.
ZipVoice-Dialog A dialogue generation model built on ZipVoice, capable of generating single-channel two-party spoken dialogues.
ZipVoice-Dialog-Stereo The stereo version of ZipVoice-Dialog, supporting two-channel dialogue generation with each speaker assigned to a separate channel.

News

2025/07/14: ZipVoice-Dialog and ZipVoice-Dialog-Stereo, two spoken dialogue generation models, are released. arXiv demo page

2025/07/14: The OpenDialog dataset, a 6.8k-hour spoken dialogue dataset, is released. Download at hf, ms. See details at arXiv.

2025/06/16: ZipVoice and ZipVoice-Distill are released. arXiv demo page

Installation

1. Clone the ZipVoice repository

git clone https://github.com/k2-fsa/ZipVoice.git

2. (Optional) Create a Python virtual environment

python3 -m venv zipvoice
source zipvoice/bin/activate

3. Install the required packages

pip install -r requirements.txt

4. Install k2 for training or efficient inference

k2 is necessary for training and can speed up inference. Nevertheless, you can still use the inference mode of ZipVoice without installing k2.

Note: Make sure to install the k2 version that matches your PyTorch and CUDA version. For example, if you are using pytorch 2.5.1 and CUDA 12.1, you can install k2 as follows:

pip install k2==1.24.4.dev20250208+cuda12.1.torch2.5.1 -f https://k2-fsa.github.io/k2/cuda.html
Please refer to https://k2-fsa.org/get-started/k2/ for details. Users in China mainland can refer to https://k2-fsa.org/zh-CN/get-started/k2/.

python3 -c "import k2; print(k2.__file__)"

Usage

1. Single-speaker speech generation

To generate single-speaker speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace):

#### 1.1 Inference of a single sentence

python3 -m zipvoice.bin.infer_zipvoice \
    --model-name zipvoice \
    --prompt-wav prompt.wav \
    --prompt-text "I am the transcription of the prompt wav." \
    --text "I am the text to be synthesized." \
    --res-wav-path result.wav
#### 1.2 Inference for a list of sentences

python3 -m zipvoice.bin.infer_zipvoice \
    --model-name zipvoice \
    --test-list test.tsv \
    --res-dir results

2. Dialogue speech generation

#### 2.1 Inference command

To generate two-party spoken dialogues with our pre-trained ZipVoice-Dialogue or ZipVoice-Dialogue-Stereo models, use the following commands (Required models will be downloaded from HuggingFace):

python3 -m zipvoice.bin.infer_zipvoice_dialog \
    --model-name "zipvoice_dialog" \
    --test-list test.tsv \
    --res-dir results
which generate mono and stereo dialogues, respectively.

#### 2.2 Input formats

Each line of test.tsv is in one of the following formats:

(1) Merged prompt format where the audios and transcriptions of two speakers prompts are merged into one prompt wav file:

{wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}

(2) Splitted prompt format where the audios and transcriptions of two speakers exist in separate files:

{wav_name}\t{spk1_prompt_transcription}\t{spk2_prompt_transcription}\t{spk1_prompt_wav}\t{spk2_prompt_wav}\t{text}

3 Guidance for better usage:

#### 3.1 Prompt length

We recommend a short prompt wav file (e.g., less than 3 seconds for single-speaker speech generation, less than 10 seconds for dialogue speech generation) for faster inference speed. A very long prompt will slow down inference and degrade speech quality.

#### 3.2 Speed optimization

If the inference speed is unsatisfactory, you can speed it up as follows:

#### 3.3 Memory control

The given text will be split into chunks based on punctuation (for single-speaker speech generation) or speaker-turn symbol (for dialogue speech generation). Then, the chunked texts will be processed in batches. Therefore, the model can process arbitrarily long text with almost constant memory usage. You can control memory usage by adjusting the --max-duration parameter.

#### 3.4 "Raw" evaluation

By default, we preprocess inputs (prompt wav, prompt transcription, and text) for efficient inference and better performance. If you want to evaluate the model's "raw" performance using the exact provided inputs (e.g., to reproduce the results in our paper), you can pass --raw-evaluation True.

#### 3.5 Short text

When generating speech for very short texts (e.g., one or two words), the generated speech may sometimes omit certain pronunciations. To resolve this issue, you can pass --speed 0.3 (where 0.3 is a tunable value) to extend the duration of the generated speech.

#### 3.6 Correcting mispronounced Chinese polyphonic characters

We use pypinyin to convert Chinese characters to pinyin. However, it can occasionally mispronounce polyphonic characters (多音字).

To manually correct these mispronunciations, enclose the corrected pinyin in angle brackets < > and include the tone mark.

Example:

> Note: If you want to manually assign multiple pinyins, enclose each pinyin with <>, e.g., 这把十公分

#### 3.7 Remove long silences from the generated speech

The model will automatically determine the positions and lengths of silences in the generated speech. Sometimes, it may produce long silences in the middle of the speech. If you do not want this, you can pass --remove-long-sil to remove long silences in the middle of the generated speech (edge silences will be removed by default).

#### 3.8 Model downloading

If you encounter difficulties connecting to HuggingFace when downloading the pre-trained models, try switching the endpoint to the mirror site: export HF_ENDPOINT=https://hf-mirror.com.

Train Your Own Model

See the egs directory for training, fine-tuning, and evaluation examples.

C++ Deployment

Check sherpa-onnx for the C++ deployment solution on CPU.

Discussion & Communication

You can directly discuss on Github Issues.

You can also scan the QR code to join our WeChat group or follow our WeChat official account.

| WeChat Group | WeChat Official Account | | ------------ | ----------------------- | |wechat |wechat |

Citation

@article{zhu2025zipvoice,
      title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
      author={Zhu, Han and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Li, Zhaoqing and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2506.13053},
      year={2025}
}

@article{zhu2025zipvoicedialog, title={ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching}, author={Zhu, Han and Kang, Wei and Guo, Liyong and Yao, Zengwei and Kuang, Fangjun and Zhuang, Weiji and Li, Zhaoqing and Han, Zhifeng and Zhang, Dong and Zhang, Xin and Song, Xingchen and Lin, Long and Povey, Daniel}, journal={arXiv preprint arXiv:2507.09318}, year={2025} }

--- Tranlated By Open Ai Tx | Last indexed: 2025-10-06 ---