Get the character encoding of the XML document http_headers is a dictionary xml_data is a raw string (not Unicode) This is so much trickier than it sounds, it's not even funny. According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type is application/xml, applicati
(http_headers, xml_data)
| 3370 | return None |
| 3371 | |
| 3372 | def _getCharacterEncoding(http_headers, xml_data): |
| 3373 | '''Get the character encoding of the XML document |
| 3374 | |
| 3375 | http_headers is a dictionary |
| 3376 | xml_data is a raw string (not Unicode) |
| 3377 | |
| 3378 | This is so much trickier than it sounds, it's not even funny. |
| 3379 | According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type |
| 3380 | is application/xml, application/*+xml, |
| 3381 | application/xml-external-parsed-entity, or application/xml-dtd, |
| 3382 | the encoding given in the charset parameter of the HTTP Content-Type |
| 3383 | takes precedence over the encoding given in the XML prefix within the |
| 3384 | document, and defaults to 'utf-8' if neither are specified. But, if |
| 3385 | the HTTP Content-Type is text/xml, text/*+xml, or |
| 3386 | text/xml-external-parsed-entity, the encoding given in the XML prefix |
| 3387 | within the document is ALWAYS IGNORED and only the encoding given in |
| 3388 | the charset parameter of the HTTP Content-Type header should be |
| 3389 | respected, and it defaults to 'us-ascii' if not specified. |
| 3390 | |
| 3391 | Furthermore, discussion on the atom-syntax mailing list with the |
| 3392 | author of RFC 3023 leads me to the conclusion that any document |
| 3393 | served with a Content-Type of text/* and no charset parameter |
| 3394 | must be treated as us-ascii. (We now do this.) And also that it |
| 3395 | must always be flagged as non-well-formed. (We now do this too.) |
| 3396 | |
| 3397 | If Content-Type is unspecified (input was local file or non-HTTP source) |
| 3398 | or unrecognized (server just got it totally wrong), then go by the |
| 3399 | encoding given in the XML prefix of the document and default to |
| 3400 | 'iso-8859-1' as per the HTTP specification (RFC 2616). |
| 3401 | |
| 3402 | Then, assuming we didn't find a character encoding in the HTTP headers |
| 3403 | (and the HTTP Content-type allowed us to look in the body), we need |
| 3404 | to sniff the first few bytes of the XML data and try to determine |
| 3405 | whether the encoding is ASCII-compatible. Section F of the XML |
| 3406 | specification shows the way here: |
| 3407 | http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info |
| 3408 | |
| 3409 | If the sniffed encoding is not ASCII-compatible, we need to make it |
| 3410 | ASCII compatible so that we can sniff further into the XML declaration |
| 3411 | to find the encoding attribute, which will tell us the true encoding. |
| 3412 | |
| 3413 | Of course, none of this guarantees that we will be able to parse the |
| 3414 | feed in the declared character encoding (assuming it was declared |
| 3415 | correctly, which many are not). CJKCodecs and iconv_codec help a lot; |
| 3416 | you should definitely install them if you can. |
| 3417 | http://cjkpython.i18n.org/ |
| 3418 | ''' |
| 3419 | |
| 3420 | def _parseHTTPContentType(content_type): |
| 3421 | '''takes HTTP Content-Type header and returns (content type, charset) |
| 3422 | |
| 3423 | If no charset is specified, returns (content type, '') |
| 3424 | If no content type is specified, returns ('', '') |
| 3425 | Both return parameters are guaranteed to be lowercase strings |
| 3426 | ''' |
| 3427 | content_type = content_type or '' |
| 3428 | content_type, params = cgi.parse_header(content_type) |
| 3429 | return content_type, params.get('charset', '').replace("'", '') |