hub / github.com/Data-Centric-AI-Community/fg-data-profiling

github.com/Data-Centric-AI-Community/fg-data-profiling @4.19.1 sqlite

repository ↗ · DeepWiki ↗ · release 4.19.1 ↗

1,426 symbols 5,725 edges 288 files 247 documented · 17%

README

fg-data-profiling

ydata-profiling is now fg-data-profiling. This package has been renamed to fg-data-profiling. Please follow the Migration Guide as soon as possible — the old package will no longer receive updates or bug fixes.

Data Profiling Logo

Documentation | Discord | Stack Overflow | Latest changelog

Do you like this project? Show us your love and give feedback!

fg-data-profiling primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas df.describe() function, that is so handy, fg-data-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.

The package outputs a simple and digested analysis of a dataset, including time-series and text.

Looking for a scalable solution that can fully integrate with your database systems?

Leverage YData Fabric Data Catalog to connect to different databases and storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.) and leverage an interactive and guided profiling experience in Fabric. Check out the Community Version.

Migration Guide

1. Uninstall the old package

pip uninstall ydata-profiling

2. Install the new package

pip install fg-data-profiling

3. Update your imports

Find and replace all occurrences of the old import in your codebase:

# Before
import ydata_profiling
from ydata_profiling import ProfileReport

# After
import data_profiling
from data_profiling import ProfileReport

You can use this one-liner to find all affected files:

grep -r "ydata_profiling" . --include="*.py"

▶️ Quickstart

Install

pip install fg-data-profiling

conda install -c conda-forge fg-data-profiling

Start profiling

Start by loading your pandas DataFrame as you normally would, e.g. by using:

import numpy as np
import pandas as pd
from data_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

To generate the standard profiling report, merely run:

profile = ProfileReport(df, title="Profiling Report")

📊 Key features

Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)
Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

The report contains three additional sections:

Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
Reproduction: technical details about the analysis (time, version and configuration)

🎁 Latest features

Want to scale? Check the latest release with ⭐⚡Spark support!
Looking for how you can do an EDA for Time-Series 🕛 ? Check this blogpost.
You want to compare 2 datasets and get a report? Check this blogpost

✨ Spark

Spark support has been released, but we are always looking for an extra pair of hands 👐. Check current work in progress!.

📝 Use cases

fg-data-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:

Use case	Description
Comparing datasets	Comparing multiple version of the same dataset
Profiling a Time-Series dataset	Generating a report for a time-series dataset with a single line of code
Profiling large datasets	Tips on how to prepare data and configure `fg-data-profiling` for working with large datasets
Handling sensitive data	Generating reports which are mindful about sensitive data in the input dataset
Dataset metadata and data dictionaries	Complementing the report with dataset details and column-specific data dictionaries
Customizing the report's appearance	Changing the appearance of the report's page and of the contained visualizations
Profiling Databases	For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others.
### Using inside Jupyter Notebooks

There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.

Notebook Widgets

The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:

profile.to_widgets()

The HTML report can be directly embedded in a cell in a similar fashion:

profile.to_notebook_iframe()

HTML

Exporting the report to a file

To generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, the report's data can be obtained as a JSON file:

# As a JSON string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Using in the command line

For standard formatted CSV files (which can be read directly by pandas without additional settings), the data_profiling executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml, in the file report.html by processing a data.csv dataset.

data_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html

Additional details on the CLI are available on the documentation.

👀 Examples

The following example reports showcase the potentialities of the package across a wide range of dataset and data types:

Census Income (US Adult Census data relating income with other demographic properties)
NASA Meteorites (comprehensive set of meteorite landing - object properties and locations)
Titanic (the "Wonderwall" of datasets)
NZA (open data from the Dutch Healthcare Authority)
Stata Auto (1978 Automobile data)
Colors (a simple colors dataset)
Vektis (Vektis Dutch Healthcare data)
UCI Bank Dataset (marketing dataset from a bank)
Russian Vocabulary (100 most common Russian words, showcasing unicode text analysis)
Website Inaccessibility (website accessibility analysis, showcasing support for URL data)
Orange prices and
Coal prices (simple pricing evolution datasets, showcasing the theming options)
USA Air Quality (Time-series air quality dataset EDA example)
HCC (Open dataset from healthcare, showcasing compare between two sets of data, before and after preprocessing)

🛠️ Installation

Additional details, including information about widget support, are available on the documentation.

Using pip

[![PyPi Downloads](https://pepy.tech/badge/fg-data

Core symbols most depended-on inside this repo

fmt_numeric

called by 68

src/data_profiling/report/formatters.py

to_html

called by 48

src/data_profiling/profile_report.py

to_file

called by 43

src/data_profiling/profile_report.py

fmt

called by 39

src/data_profiling/report/formatters.py

fmt_percent

called by 31

src/data_profiling/report/formatters.py

sum

called by 28

src/data_profiling/model/spark/missing_spark.py

get_description

called by 27

src/data_profiling/profile_report.py

cache_file

called by 22

src/data_profiling/utils/cache.py

Shape

Function 713

Method 509

Class 188

Route 16

Languages

Python73%

TypeScript27%

Modules by API surface

src/data_profiling/report/presentation/flavours/html/templates/wrapper/assets/bootstrap.bundle.min.js390 symbols

src/data_profiling/model/alerts.py83 symbols

src/data_profiling/config.py40 symbols

src/data_profiling/model/typeset.py39 symbols

src/data_profiling/visualisation/plot.py35 symbols

src/data_profiling/profile_report.py29 symbols

src/data_profiling/model/summary_algorithms.py21 symbols

src/data_profiling/report/formatters.py19 symbols

tests/unit/test_ge_integration_expectations.py16 symbols

src/data_profiling/model/typeset_relations.py16 symbols

src/data_profiling/model/correlations.py15 symbols

src/data_profiling/compare_reports.py15 symbols

For agents

$ claude mcp add fg-data-profiling \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact