Web Analytics

unifloral

⭐ 184 stars English by EmptyJackson

🌹 Unifloral: Unified Offline Reinforcement Learning

Unified implementations and rigorous evaluation for offline reinforcement learning - built by Matthew Jackson, Uljad Berdica, and Jarek Liesen.

💡 Code Philosophy

Inspired by CORL and CleanRL - check them out!

🤖 Algorithms

We provide two types of algorithm implementation:

After training, final evaluation results are saved to .npz files in final_returns/ for analysis using our evaluation protocol.

All scripts support D4RL and use Weights & Biases for logging, with configs provided as WandB sweep files.

Model-free

| Algorithm | Standalone | Unified | Extras | | --- | --- | --- | --- | | BC | bc.py | unifloral/bc.yaml | - | | SAC-N | sac_n.py | unifloral/sac_n.yaml | [[ArXiv]](https://arxiv.org/abs/2110.01548) | | EDAC | edac.py | unifloral/edac.yaml | [[ArXiv]](https://arxiv.org/abs/2110.01548) | | CQL | cql.py | - | [[ArXiv]](https://arxiv.org/abs/2006.04779) | | IQL | iql.py | unifloral/iql.yaml | [[ArXiv]](https://arxiv.org/abs/2110.06169) | | TD3-BC | td3_bc.py | unifloral/td3_bc.yaml | [[ArXiv]](https://arxiv.org/abs/2106.06860) | | ReBRAC | rebrac.py | unifloral/rebrac.yaml | [[ArXiv]](https://arxiv.org/abs/2305.09836) | | TD3-AWR | - | unifloral/td3_awr.yaml | [[ArXiv]](https://arxiv.org/abs/2504.11453) |

Model-based

We implement a single script for dynamics model training: dynamics.py, with config dynamics.yaml.

| Algorithm | Standalone | Unified | Extras | | --- | --- | --- | --- | | MOPO | mopo.py | - | [[ArXiv]](https://arxiv.org/abs/2005.13239) | | MOReL | morel.py | - | [[ArXiv]](https://arxiv.org/abs/2005.05951) | | COMBO | combo.py | - | [[ArXiv]](https://arxiv.org/abs/2102.08363) | | MoBRAC | - | unifloral/mobrac.yaml | [[ArXiv]](https://arxiv.org/abs/2504.11453) |

New ones coming soon 👀

📊 Evaluation

Our evaluation script (evaluation.py) implements the protocol described in our paper, analysing the performance of a UCB bandit over a range of policy evaluations.

from evaluation import load_results_dataframe, bootstrap_bandit_trials
import jax.numpy as jnp

Load all results from the final_returns directory

df = load_results_dataframe("final_returns")

Run bandit trials with bootstrapped confidence intervals

results = bootstrap_bandit_trials( returns_array=jnp.array(policy_returns), # Shape: (num_policies, num_rollouts) num_subsample=8, # Number of policies to subsample num_repeats=1000, # Number of bandit trials max_pulls=200, # Maximum pulls per trial ucb_alpha=2.0, # UCB exploration coefficient n_bootstraps=1000, # Bootstrap samples for confidence intervals confidence=0.95 # Confidence level )

Access results

pulls = results["pulls"] # Number of pulls at each step means = results["estimated_bests_mean"] # Mean score of estimated best policy ci_low = results["estimated_bests_ci_low"] # Lower confidence bound ci_high = results["estimated_bests_ci_high"] # Upper confidence bound

📝 Cite us!

@misc{jackson2025clean,
      title={A Clean Slate for Offline Reinforcement Learning},
      author={Matthew Thomas Jackson and Uljad Berdica and Jarek Liesen and Shimon Whiteson and Jakob Nicolaus Foerster},
      year={2025},
      eprint={2504.11453},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.11453},
}

--- Tranlated By Open Ai Tx | Last indexed: 2026-01-08 ---