Parse a distribution's metadata stored as email headers (e.g. from ``METADATA``). This function returns a two-item tuple of dicts. The first dict is of recognized fields from the core metadata specification. Fields that can be parsed and translated into Python's built-in types are conve
(data: Union[bytes, str])
| 276 | |
| 277 | |
| 278 | def parse_email(data: Union[bytes, str]) -> Tuple[RawMetadata, Dict[str, List[str]]]: |
| 279 | """Parse a distribution's metadata stored as email headers (e.g. from ``METADATA``). |
| 280 | |
| 281 | This function returns a two-item tuple of dicts. The first dict is of |
| 282 | recognized fields from the core metadata specification. Fields that can be |
| 283 | parsed and translated into Python's built-in types are converted |
| 284 | appropriately. All other fields are left as-is. Fields that are allowed to |
| 285 | appear multiple times are stored as lists. |
| 286 | |
| 287 | The second dict contains all other fields from the metadata. This includes |
| 288 | any unrecognized fields. It also includes any fields which are expected to |
| 289 | be parsed into a built-in type but were not formatted appropriately. Finally, |
| 290 | any fields that are expected to appear only once but are repeated are |
| 291 | included in this dict. |
| 292 | |
| 293 | """ |
| 294 | raw: Dict[str, Union[str, List[str], Dict[str, str]]] = {} |
| 295 | unparsed: Dict[str, List[str]] = {} |
| 296 | |
| 297 | if isinstance(data, str): |
| 298 | parsed = email.parser.Parser(policy=email.policy.compat32).parsestr(data) |
| 299 | else: |
| 300 | parsed = email.parser.BytesParser(policy=email.policy.compat32).parsebytes(data) |
| 301 | |
| 302 | # We have to wrap parsed.keys() in a set, because in the case of multiple |
| 303 | # values for a key (a list), the key will appear multiple times in the |
| 304 | # list of keys, but we're avoiding that by using get_all(). |
| 305 | for name in frozenset(parsed.keys()): |
| 306 | # Header names in RFC are case insensitive, so we'll normalize to all |
| 307 | # lower case to make comparisons easier. |
| 308 | name = name.lower() |
| 309 | |
| 310 | # We use get_all() here, even for fields that aren't multiple use, |
| 311 | # because otherwise someone could have e.g. two Name fields, and we |
| 312 | # would just silently ignore it rather than doing something about it. |
| 313 | headers = parsed.get_all(name) or [] |
| 314 | |
| 315 | # The way the email module works when parsing bytes is that it |
| 316 | # unconditionally decodes the bytes as ascii using the surrogateescape |
| 317 | # handler. When you pull that data back out (such as with get_all() ), |
| 318 | # it looks to see if the str has any surrogate escapes, and if it does |
| 319 | # it wraps it in a Header object instead of returning the string. |
| 320 | # |
| 321 | # As such, we'll look for those Header objects, and fix up the encoding. |
| 322 | value = [] |
| 323 | # Flag if we have run into any issues processing the headers, thus |
| 324 | # signalling that the data belongs in 'unparsed'. |
| 325 | valid_encoding = True |
| 326 | for h in headers: |
| 327 | # It's unclear if this can return more types than just a Header or |
| 328 | # a str, so we'll just assert here to make sure. |
| 329 | assert isinstance(h, (email.header.Header, str)) |
| 330 | |
| 331 | # If it's a header object, we need to do our little dance to get |
| 332 | # the real data out of it. In cases where there is invalid data |
| 333 | # we're going to end up with mojibake, but there's no obvious, good |
| 334 | # way around that without reimplementing parts of the Header object |
| 335 | # ourselves. |
no test coverage detected
searching dependent graphs…