MCPcopy
hub / github.com/maziyarpanahi/openmed / deidentify

Function deidentify

openmed/core/pii.py:1698–1897  ·  view source on GitHub ↗

De-identify text by detecting and redacting PII with intelligent merging. Implements multiple de-identification strategies for HIPAA compliance: - **mask**: Replace with placeholders like [NAME], [EMAIL], etc. - **remove**: Remove PII text entirely (empty string) - **replace**: Rep

(
    text: str,
    method: DeidentificationMethod = "mask",
    model_name: str = _DEFAULT_EN_MODEL,
    confidence_threshold: float = 0.7,  # Higher threshold for safety
    keep_year: bool = False,
    shift_dates: Optional[bool] = None,
    date_shift_days: Optional[int] = None,
    patient_key: Optional[str | bytes] = None,
    date_shift_max_days: Optional[int] = None,
    date_shift_secret: Optional[str | bytes] = None,
    keep_mapping: bool = False,
    config: Optional[OpenMedConfig] = None,
    use_smart_merging: bool = True,
    lang: str = "en",
    normalize_accents: Optional[bool] = None,
    use_safety_sweep: bool = True,
    *,
    consistent: bool = False,
    seed: Optional[int] = None,
    locale: Optional[str] = None,
    surrogate_vault: Optional["SurrogateVault"] = None,
    loader: Optional["ModelLoader"] = None,
    policy: Optional[str] = None,
    calibration_thresholds_path: Optional[str | Path] = None,
    custom_recognizer: Any = None,
    audit: bool = False,
    cache_results: bool = False,
    max_cache_entries: int = 128,
)

Source from the content-addressed store, hash-verified

1696
1697
1698def deidentify(
1699 text: str,
1700 method: DeidentificationMethod = "mask",
1701 model_name: str = _DEFAULT_EN_MODEL,
1702 confidence_threshold: float = 0.7, # Higher threshold for safety
1703 keep_year: bool = False,
1704 shift_dates: Optional[bool] = None,
1705 date_shift_days: Optional[int] = None,
1706 patient_key: Optional[str | bytes] = None,
1707 date_shift_max_days: Optional[int] = None,
1708 date_shift_secret: Optional[str | bytes] = None,
1709 keep_mapping: bool = False,
1710 config: Optional[OpenMedConfig] = None,
1711 use_smart_merging: bool = True,
1712 lang: str = "en",
1713 normalize_accents: Optional[bool] = None,
1714 use_safety_sweep: bool = True,
1715 *,
1716 consistent: bool = False,
1717 seed: Optional[int] = None,
1718 locale: Optional[str] = None,
1719 surrogate_vault: Optional["SurrogateVault"] = None,
1720 loader: Optional["ModelLoader"] = None,
1721 policy: Optional[str] = None,
1722 calibration_thresholds_path: Optional[str | Path] = None,
1723 custom_recognizer: Any = None,
1724 audit: bool = False,
1725 cache_results: bool = False,
1726 max_cache_entries: int = 128,
1727) -> DeidentificationResult | "AuditReport":
1728 """De-identify text by detecting and redacting PII with intelligent merging.
1729
1730 Implements multiple de-identification strategies for HIPAA compliance:
1731
1732 - **mask**: Replace with placeholders like [NAME], [EMAIL], etc.
1733 - **remove**: Remove PII text entirely (empty string)
1734 - **replace**: Replace with fake but realistic data
1735 - **hash**: Replace with consistent hashed values for entity linking
1736 - **format_preserve**: Replace structured identifiers with synthetic
1737 values that keep shape and separators, masking unsupported labels
1738 - **shift_dates**: Shift dates by random offset while preserving intervals
1739
1740 Smart merging uses regex patterns to merge fragmented entities (e.g., dates
1741 split into '01' and '/15/1970' are merged into complete '01/15/1970').
1742
1743 Args:
1744 text: Input text to de-identify
1745 method: De-identification method (mask, remove, replace, hash,
1746 shift_dates, format_preserve)
1747 model_name: PII detection model
1748 confidence_threshold: Minimum confidence for redaction (default 0.7 for safety)
1749 keep_year: For dates, keep the year unchanged
1750 shift_dates: Deprecated alias for ``method="shift_dates"``.
1751 date_shift_days: Specific number of days to shift when ``patient_key``
1752 is omitted. When ``patient_key`` is supplied, this is treated as a
1753 legacy maximum absolute offset bound unless ``date_shift_max_days``
1754 is also supplied.
1755 patient_key: Optional stable patient identifier used only to derive a

Calls 6

runMethod · 0.95
make_cache_keyFunction · 0.85
get_result_cacheFunction · 0.85
PipelineClass · 0.85
getMethod · 0.45
setMethod · 0.45