🌐 Language

index-tts-lora

Chinese Version | English Version

This project is based on Bilibili's index-tts and provides LoRA fine-tuning solutions for both single-speaker and multi-speaker scenarios. It aims to enhance the prosody and naturalness of high-quality speaker audio synthesis.

Training & Inference

#### 1. Audio token and speaker condition extraction

# Extract tokens and speaker conditions
python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
audio_list format: audio_path + transcript, separated by \t
/path/to/audio.wav 小朋友们，大家好，我是凯叔，今天我们讲一个龟兔赛跑的故事。

After extraction, the processed files and speaker_info.json will be generated under the finetune_data/processed_data/ directory. For example:

[
    {
        "speaker": "kaishu_30min",
        "avg_duration": 6.6729,
        "sample_num": 270,
        "total_duration_in_seconds": 1801.696,
        "total_duration_in_minutes": 30.028,
        "total_duration_in_hours": 0.500,
        "train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
        "valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
        "medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
    }
]

#### 2. Training

python train.py

#### 3. Inference

python indextts/infer.py

Fine-tuning Results

This experiment uses Chinese audio data from Kai Shu Tells Stories, with a total duration of ~30 minutes and 270 audio clips. The dataset is split into 244 training samples and 26 validation samples. Note: Transcripts were generated automatically via ASR and punctuation models, without manual correction, so some errors are expected.

Example training sample, He got on the carriage and arrived at the palace.: kaishu_train_01.wav

#### 1. Speech Synthesis Examples

| Text | Audio | | ------------------------------------------------------------ | ------------------------------------------------------------ | | The old clock stopped at midnight three, and a string of unfamiliar footprints emerged from the dust. The detective crouched down and found a bloodstained ring hidden in the floor crack. | kaishu_cn_1.wav | | Under the moonlight, the pumpkin suddenly grew a smiling face, the vines twisted and pushed open the garden gate. The little girl tiptoed and heard the mushrooms humming an ancient lullaby. | kaishu_cn_2.wav | | So for intermediate level in Java, you still need to learn, M and then external frontend application system development, need to learn Java Script databases, and dynamic website creation. | kaishu_cn_en_mix_1.wav | | This financial report provides a detailed analysis of the company's revenue performance and expenditure trends over the past quarter. | kaishu_cn_en_mix_2.wav | | Up the mountain, down the mountain, up one mountain, then another, ran three li and three meters and three, climbed a tall mountain, altitude three hundred and three. Reached the top and shouted loudly: I am three feet three taller than the mountain. | kaishu_raokouling.wav | | A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. | kaishu_en_1.wav | | As research continued, the protective effect of fluoride against dental decay was demonstrated. | kaishu_en_2.wav |

#### 2. Model Evaluation

Acknowledgements

index-tts

finetune-index-tts

--- Tranlated By Open Ai Tx | Last indexed: 2025-12-28 ---