Web Analytics

FAC-Synthesis

⭐ 96 stars English by Zhongzhi660

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

This is the official implementation of the paper: Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs.


Core Insight

Work smarter, not harder.

In the post-training stage of LLMs, instead of blindly adding massive amounts of surface-level diverse text, it is more effective to precisely identify and synthesize those truly missing key features. With only a small number of targeted synthetic samples, we can significantly fill the gaps in Feature Activation Coverage (FAC), leading to clear performance improvements on downstream tasks.

Why is this insight simple yet powerful?

Traditional data synthesis focuses on quantity and surface diversity (vocabulary, sentence patterns, topic distribution), but these are often just weak proxies. What truly determines a model's downstream performance is the coverage of key features required by the target task.

Our work reveals:

Figure 1: The Efficiency Frontier of Instruction Following Datasets. Our proposed method achieves a Win Rate on AlpacaEval 2.0 comparable to MAGPIE while using only 2K synthetic samples (vs. 300K for MAGPIE).


Getting Started

Installation

git clone https://github.com/Zhongzhi660/FAC-Synthesis.git
cd FAC-Synthesis
pip install -r requirements.txt


Repository Structure

FAC-Synthesis/
├── LICENSE
├── README.md
├── requirements.txt
│
├── sae_pretrain/                 # SAE pretraining
│   ├── datasets/                 # pretraining corpora (constructed from public sources)
│   └── outputs/                  # SAE pre-trained weights
│
├── sae_feature_analysis/         # SAE feature analysis pipeline
│   ├── interpret_features/       # feature interpretation (span collection + annotation)
│   ├── identify_task_relevant_features/   # task-relevant feature identification
│   └── identify_missing_features/         # missing-feature discovery (coverage gap)
│
├── fac_synthesis/                # FAC synthesis pipeline
│   ├── step1_contrastive_pair_construction/      # Step-1: contrastive pair construction
│   └── step2_feature_covered_sample_synthesis/   # Step-2: feature-covered synthesis
│
└── training_scripts/             # Downstream training / evaluation scripts
    ├── toxicity_detection/
    ├── reward_modeling/
    ├── instruction_following/
    └── behavior_steering/

Pre-training Sparse Autoencoders

Most of the scripts for SAE pretraining are located in sae_pretrain/. We provide pre-trained SAE checkpoints on Hugging Face: To pre-train SAEs, run the following commands:

# Step-1: Collect hidden activations from the backbone LLM (e.g., layer 16)
python create_actvs_uni.py 0 0 1 meta-llama/Llama-3.1-8B-Instruct 16

Step-2: Train SAEs on the target layer (e.g., layer 16)

python train_SAEs.py 0 16 meta-llama/Llama-3.1-8B-Instruct /sae_input/prompt_actvs_l16

Analyzing the features of SAE

Feature analysis scripts are located in sae_feature_analysis/. To group activation spans and generate human-readable feature interpretations, run:

# Step-1: Group extracted activation spans
python groupby_textspans.py /xxx/threshold_0.0

Step-2: Annotate feature explanations based on grouped spans

python annotate_explanations.py /xxx/threshold_0.0.tsv

Step-3: Identify task-relevant features from the explanations

python annotate_toxicity.py /xxx/threshold_0.0_explained.tsv

Step-4: Identify missing features via FAC analysis

python identify_fac.py anchor_features.tsv (complete) task_features.tsv (currently available)

Coverage-guided data synthesis

Coverage-guided synthesis scripts are located in fac_synthesis/. To generate synthetic queries, run

# Step-1 (1): Contrastive Pair Construction
python generate_data_llama_r1.py \
  --features xxx.tsv \
  --out xxx \
  --temperature 0.8

Step-1 (2): Feature-Covered Sample Synthesis

python analyze_step1_synthetic_data.py python merge_step1_failed_cases.py

Step-2: Feature-Covered Sample Synthesis

python generate_data_llama_r2.py \ --features xxx.tsv \ --out xxx \ --temperature 0.8


Acknowledgements

In the evaluation stage, our downstream training and testing scripts are adapted from the following open-source repositories:

Citation

If you find this work helpful for your research, please cite our paper 🤩:

@article{li2026less,
  title={Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs},
  author={Li, Zhongzhi and Wu, Xuansheng and Li, Yijiang and Hu, Lijie and Liu, Ninghao},
  journal={arXiv preprint arXiv:2602.10388},
  year={2026}
}

--- Tranlated By Open Ai Tx | Last indexed: 2026-05-25 ---