📊 🔎 ✅
Data validation for scientists, engineers, and analysts seeking correctness.
Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.
Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas DataFrames, install Pandera with the pandas extra:
With pip:
pip install 'pandera[pandas]'
With uv:
uv pip install 'pandera[pandas]'
With conda:
conda install -c conda-forge pandera-pandas
First, create a dataframe:
import pandas as pd
import pandera.pandas as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": [1.1, 1.2, 1.3],
"column3": ["a", "b", "c"],
})
Validate the data using the object-based API:
# define a schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, pa.Check.ge(0)),
"column2": pa.Column(float, pa.Check.lt(10)),
"column3": pa.Column(
str,
[
pa.Check.isin([*"abc"]),
pa.Check(lambda series: series.str.len() == 1),
]
),
})
print(schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
Or validate the data using the class-based API:
# define a schema
class Schema(pa.DataFrameModel):
column1: int = pa.Field(ge=0)
column2: float = pa.Field(lt=10)
column3: str = pa.Field(isin=[*"abc"])
@pa.check("column3")
def custom_check(cls, series: pd.Series) -> pd.Series:
return series.str.len() == 1
print(Schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
[!WARNING] Pandera
v0.24.0introduces thepandera.pandasmodule, which is now the (highly) recommended way of definingDataFrameSchemas andDataFrameModels forpandasdata structures likeDataFrames. Defining a dataframe schema from the top-levelpanderamodule will produce aFutureWarning:```python import pandera as pa
schema = pa.DataFrameSchema({"col": pa.Column(str)}) ```
Update your import to:
python import pandera.pandas as paAnd all of the rest of your pandera code should work. Using the top-level
panderamodule to accessDataFrameSchemaand the other pandera classes or functions will be deprecated in version0.29.0
See the official documentation to learn more.
$ claude mcp add pandera \
-- python -m otcore.mcp_server <graph>