De-identify text by detecting and redacting PII with intelligent merging. Implements multiple de-identification strategies for HIPAA compliance: - **mask**: Replace with placeholders like [NAME], [EMAIL], etc. - **remove**: Remove PII text entirely (empty string) - **replace**: Rep
(
text: str,
method: DeidentificationMethod = "mask",
model_name: str = _DEFAULT_EN_MODEL,
confidence_threshold: float = 0.7, # Higher threshold for safety
keep_year: bool = False,
shift_dates: Optional[bool] = None,
date_shift_days: Optional[int] = None,
patient_key: Optional[str | bytes] = None,
date_shift_max_days: Optional[int] = None,
date_shift_secret: Optional[str | bytes] = None,
keep_mapping: bool = False,
config: Optional[OpenMedConfig] = None,
use_smart_merging: bool = True,
lang: str = "en",
normalize_accents: Optional[bool] = None,
use_safety_sweep: bool = True,
*,
consistent: bool = False,
seed: Optional[int] = None,
locale: Optional[str] = None,
surrogate_vault: Optional["SurrogateVault"] = None,
loader: Optional["ModelLoader"] = None,
policy: Optional[str] = None,
calibration_thresholds_path: Optional[str | Path] = None,
custom_recognizer: Any = None,
audit: bool = False,
cache_results: bool = False,
max_cache_entries: int = 128,
)
| 1696 | |
| 1697 | |
| 1698 | def deidentify( |
| 1699 | text: str, |
| 1700 | method: DeidentificationMethod = "mask", |
| 1701 | model_name: str = _DEFAULT_EN_MODEL, |
| 1702 | confidence_threshold: float = 0.7, # Higher threshold for safety |
| 1703 | keep_year: bool = False, |
| 1704 | shift_dates: Optional[bool] = None, |
| 1705 | date_shift_days: Optional[int] = None, |
| 1706 | patient_key: Optional[str | bytes] = None, |
| 1707 | date_shift_max_days: Optional[int] = None, |
| 1708 | date_shift_secret: Optional[str | bytes] = None, |
| 1709 | keep_mapping: bool = False, |
| 1710 | config: Optional[OpenMedConfig] = None, |
| 1711 | use_smart_merging: bool = True, |
| 1712 | lang: str = "en", |
| 1713 | normalize_accents: Optional[bool] = None, |
| 1714 | use_safety_sweep: bool = True, |
| 1715 | *, |
| 1716 | consistent: bool = False, |
| 1717 | seed: Optional[int] = None, |
| 1718 | locale: Optional[str] = None, |
| 1719 | surrogate_vault: Optional["SurrogateVault"] = None, |
| 1720 | loader: Optional["ModelLoader"] = None, |
| 1721 | policy: Optional[str] = None, |
| 1722 | calibration_thresholds_path: Optional[str | Path] = None, |
| 1723 | custom_recognizer: Any = None, |
| 1724 | audit: bool = False, |
| 1725 | cache_results: bool = False, |
| 1726 | max_cache_entries: int = 128, |
| 1727 | ) -> DeidentificationResult | "AuditReport": |
| 1728 | """De-identify text by detecting and redacting PII with intelligent merging. |
| 1729 | |
| 1730 | Implements multiple de-identification strategies for HIPAA compliance: |
| 1731 | |
| 1732 | - **mask**: Replace with placeholders like [NAME], [EMAIL], etc. |
| 1733 | - **remove**: Remove PII text entirely (empty string) |
| 1734 | - **replace**: Replace with fake but realistic data |
| 1735 | - **hash**: Replace with consistent hashed values for entity linking |
| 1736 | - **format_preserve**: Replace structured identifiers with synthetic |
| 1737 | values that keep shape and separators, masking unsupported labels |
| 1738 | - **shift_dates**: Shift dates by random offset while preserving intervals |
| 1739 | |
| 1740 | Smart merging uses regex patterns to merge fragmented entities (e.g., dates |
| 1741 | split into '01' and '/15/1970' are merged into complete '01/15/1970'). |
| 1742 | |
| 1743 | Args: |
| 1744 | text: Input text to de-identify |
| 1745 | method: De-identification method (mask, remove, replace, hash, |
| 1746 | shift_dates, format_preserve) |
| 1747 | model_name: PII detection model |
| 1748 | confidence_threshold: Minimum confidence for redaction (default 0.7 for safety) |
| 1749 | keep_year: For dates, keep the year unchanged |
| 1750 | shift_dates: Deprecated alias for ``method="shift_dates"``. |
| 1751 | date_shift_days: Specific number of days to shift when ``patient_key`` |
| 1752 | is omitted. When ``patient_key`` is supplied, this is treated as a |
| 1753 | legacy maximum absolute offset bound unless ``date_shift_max_days`` |
| 1754 | is also supplied. |
| 1755 | patient_key: Optional stable patient identifier used only to derive a |