🌐 Language
index-tts-lora
Chinese Version | English Version
This project is based on Bilibili's index-tts and provides LoRA fine-tuning solutions for both single-speaker and multi-speaker scenarios. It is designed to improve prosody and naturalness in high-quality speaker audio synthesis.
Training & Inference
#### 1. Audio token and speaker condition extraction
# Extract tokens and speaker conditions
python tools/extract_codec.py --audio_list ${audio_list} --extract_conditionaudio_list format: audio_path + transcript, separated by \t
/path/to/audio.wav 小朋友们,大家好,我是凯叔,今天我们讲一个龟兔赛跑的故事。
After extraction, the processed files and speaker_info.json will be generated under the finetune_data/processed_data/ directory. For example:[
{
"speaker": "kaishu_30min",
"avg_duration": 6.6729,
"sample_num": 270,
"total_duration_in_seconds": 1801.696,
"total_duration_in_minutes": 30.028,
"total_duration_in_hours": 0.500,
"train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
"valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
"medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
}
]#### 2. Training
python train.py#### 3. Inference
python indextts/infer.pyFine-tuning Results
This experiment uses Chinese audio data from Kai Shu Tells Stories, with a total duration of \~30 minutes and 270 audio clips. The dataset is split into 244 training samples and 26 validation samples. Note: Transcripts were generated automatically via ASR and punctuation models, without manual correction, so some errors are expected.
Example training sample, 他上了马车,来到了皇宫之中。:kaishu_train_01.wav
#### 1. Speech Synthesis Examples
| Text | Audio | | ------------------------------------------------------------ | ------------------------------------------------------------ | | The old house clock stopped at midnight three o’clock, and a string of unfamiliar footprints appeared in the dust. The detective crouched down and found a blood-stained ring hidden in the floorboard gap. | kaishu_cn_1.wav | | Under the moonlight, the pumpkin suddenly grew a smiling face, and the vines twisted to push open the garden fence. The little girl stood on tiptoe, hearing the mushrooms humming an ancient lullaby. | kaishu_cn_2.wav | | Then in Java, intermediate levels still need to learn M as well as external front-end application system development, need to learn JavaScript databases, and need to learn to make dynamic websites. | kaishu_cn_en_mix_1.wav | | This financial report analyzes in detail the company’s revenue performance and expenditure trends in the past quarter. | kaishu_cn_en_mix_2.wav | | Going up and down the mountain, one mountain after another, running three li three meters three, climbed a high mountain, altitude three hundred and three. On the mountain, shouted loudly: I am three chi three taller than the mountain. | kaishu_raokouling.wav | | A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. | kaishu_en_1.wav | | As research continued, the protective effect of fluoride against dental decay was demonstrated. | kaishu_en_2.wav |
#### 2. Model Evaluation
For details of the evaluation set, see: 2025 Benchmark of Mainstream TTS Models: Who Is the Best Voice Synthesis Solution?
Acknowledgements
--- Tranlated By Open Ai Tx | Last indexed: 2025-12-16 ---