Normalize an entity label to the canonical taxonomy. Accepts any of: - English lowercase ``snake_case`` (``first_name``) - Portuguese ``UPPERCASE`` no-separator (``FIRSTNAME``) - BIOES-tagged forms (``B-NAME``, ``I-EMAIL``) - Mixed case with arbitrary separators (``First
(label: str, lang: str = "en")
| 668 | |
| 669 | |
| 670 | def normalize_label(label: str, lang: str = "en") -> str: |
| 671 | """Normalize an entity label to the canonical taxonomy. |
| 672 | |
| 673 | Accepts any of: |
| 674 | - English lowercase ``snake_case`` (``first_name``) |
| 675 | - Portuguese ``UPPERCASE`` no-separator (``FIRSTNAME``) |
| 676 | - BIOES-tagged forms (``B-NAME``, ``I-EMAIL``) |
| 677 | - Mixed case with arbitrary separators (``First-Name``, ``First Name``) |
| 678 | |
| 679 | Unknown labels fall through to ``OTHER`` rather than raising — callers |
| 680 | that need strict checking should compare against ``CANONICAL_LABELS`` |
| 681 | explicitly. |
| 682 | |
| 683 | Args: |
| 684 | label: Source label as emitted by a model or registered in a config. |
| 685 | lang: ISO 639-1 language hint (currently unused but reserved for |
| 686 | language-conditional disambiguation, e.g. mapping ambiguous |
| 687 | tokens differently per locale). |
| 688 | |
| 689 | Returns: |
| 690 | A canonical label in ``UPPER_SNAKE_CASE``. |
| 691 | """ |
| 692 | if not label: |
| 693 | return OTHER |
| 694 | key = _key(label) |
| 695 | if not key: |
| 696 | return OTHER |
| 697 | canonical = _ALIAS_MAP.get(key) |
| 698 | if canonical is not None: |
| 699 | return canonical |
| 700 | # If the input already matches a canonical label after stripping |
| 701 | # separators (e.g. ``ID_NUM`` -> key ``idnum`` -> aliased; but |
| 702 | # ``CREDIT_CARD`` -> ``creditcard`` -> aliased), the alias map covers |
| 703 | # it. The ``upper`` fallback handles any future canonical label not |
| 704 | # yet in the alias map. |
| 705 | upper = re.sub(r"[^A-Z0-9_]", "", label.upper().replace("-", "_").replace(" ", "_")) |
| 706 | if upper in CANONICAL_LABELS: |
| 707 | return upper |
| 708 | return OTHER |
| 709 | |
| 710 | |
| 711 | def id_subtype_for(label: str, lang: str = "en") -> str | None: |