🔍 Wimsey
Wimsey is lightweight, flexible and fully open-source data contract library.
- 🐋 Bring your own dataframe library: Built on top of Narwhals so your tests are carried out natively in your own dataframe library (including Pandas, Polars, Pyspark, Dask, DuckDB, CuDF, Rapids, Arrow and Modin)
- 🎍 Bring your own contract format: Write contracts in yaml, json or python - whichever you prefer!
- 🪶 Ultra Lightweight: Built for fast imports and minimal overwhead with only two dependencies (Narwhals and FSSpec)
- 🥔 Simple, easy API: Low mental overheads with two simple functions for testing dataframes, and a simple dataclass for results.
What is a data contract?
As well as being a good buzzword to mention at your next data event, data contracts are a good way of testing data values at boundary points. Ideally, all data would be usable when you recieve it, but you probably already have figured that's not always the case.
A data contract is an expression of what should be true of some data - we might want to check that the only columns that exist are first_name, last_name and rating, or we might want to check that rating is a number less than 10.
Wimsey let's you write contracts in json, yaml or python, here's how the above checks would look in yaml:
- test: columns_should
be:
- first_name
- last_name
- rating
- column: rating
test: max_should
be_less_than_or_equal_to: 10
Wimsey then can execute tests for you in a couple of ways, validate - which will throw an error if tests fail, and otherwise pass back your dataframe - and test, which will give you a detailed run down of individual test success and fails.Validate is designed to work nicely with polars or pandas pipe methods as a handy guard:
import polars as pl
import wimseydf = (
pl.read_csv("hopefully_nice_data.csv")
.pipe(wimsey.validate, "tests.json")
.group_by("name").agg(pl.col("value").sum())
)
Test is a single function call, returning a FinalResult data-type:
import pandas as pd
import wimseydf = pd.read_csv("hopefully_nice_data.csv")
results = wimsey.test(df, "tests.yaml")
if results.success:
print("Yay we have good data! 🥳")
else:
print(f"Oh nooo, something's up! 😭")
print([i for i in results.results if not i.success])
Roadmap, Contributing & Feedback
Wimsey is mirrored on GitHub, but hosted and developed on codeberg. Issues and pull requests are accepted on both.
The current focus is on refining profiling and test generation; if you have tests or features that would be helpful to you, feel free to reach out!
--- Tranlated By Open Ai Tx | Last indexed: 2025-12-10 ---