MCPcopy Index your code
hub / github.com/pyload/pyload / _getCharacterEncoding

Function _getCharacterEncoding

module/lib/feedparser.py:3372–3510  ·  view source on GitHub ↗

Get the character encoding of the XML document http_headers is a dictionary xml_data is a raw string (not Unicode) This is so much trickier than it sounds, it's not even funny. According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type is application/xml, applicati

(http_headers, xml_data)

Source from the content-addressed store, hash-verified

3370 return None
3371
3372def _getCharacterEncoding(http_headers, xml_data):
3373 '''Get the character encoding of the XML document
3374
3375 http_headers is a dictionary
3376 xml_data is a raw string (not Unicode)
3377
3378 This is so much trickier than it sounds, it's not even funny.
3379 According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
3380 is application/xml, application/*+xml,
3381 application/xml-external-parsed-entity, or application/xml-dtd,
3382 the encoding given in the charset parameter of the HTTP Content-Type
3383 takes precedence over the encoding given in the XML prefix within the
3384 document, and defaults to 'utf-8' if neither are specified. But, if
3385 the HTTP Content-Type is text/xml, text/*+xml, or
3386 text/xml-external-parsed-entity, the encoding given in the XML prefix
3387 within the document is ALWAYS IGNORED and only the encoding given in
3388 the charset parameter of the HTTP Content-Type header should be
3389 respected, and it defaults to 'us-ascii' if not specified.
3390
3391 Furthermore, discussion on the atom-syntax mailing list with the
3392 author of RFC 3023 leads me to the conclusion that any document
3393 served with a Content-Type of text/* and no charset parameter
3394 must be treated as us-ascii. (We now do this.) And also that it
3395 must always be flagged as non-well-formed. (We now do this too.)
3396
3397 If Content-Type is unspecified (input was local file or non-HTTP source)
3398 or unrecognized (server just got it totally wrong), then go by the
3399 encoding given in the XML prefix of the document and default to
3400 'iso-8859-1' as per the HTTP specification (RFC 2616).
3401
3402 Then, assuming we didn't find a character encoding in the HTTP headers
3403 (and the HTTP Content-type allowed us to look in the body), we need
3404 to sniff the first few bytes of the XML data and try to determine
3405 whether the encoding is ASCII-compatible. Section F of the XML
3406 specification shows the way here:
3407 http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
3408
3409 If the sniffed encoding is not ASCII-compatible, we need to make it
3410 ASCII compatible so that we can sniff further into the XML declaration
3411 to find the encoding attribute, which will tell us the true encoding.
3412
3413 Of course, none of this guarantees that we will be able to parse the
3414 feed in the declared character encoding (assuming it was declared
3415 correctly, which many are not). CJKCodecs and iconv_codec help a lot;
3416 you should definitely install them if you can.
3417 http://cjkpython.i18n.org/
3418 '''
3419
3420 def _parseHTTPContentType(content_type):
3421 '''takes HTTP Content-Type header and returns (content type, charset)
3422
3423 If no charset is specified, returns (content type, '')
3424 If no content type is specified, returns ('', '')
3425 Both return parameters are guaranteed to be lowercase strings
3426 '''
3427 content_type = content_type or ''
3428 content_type, params = cgi.parse_header(content_type)
3429 return content_type, params.get('charset', '').replace("'", '')

Callers 1

parseFunction · 0.85

Calls 10

_parseHTTPContentTypeFunction · 0.85
_l2bytesFunction · 0.85
_ebcdic_to_asciiFunction · 0.85
_s2bytesFunction · 0.85
compileMethod · 0.80
decodeMethod · 0.80
getMethod · 0.45
encodeMethod · 0.45
matchMethod · 0.45
has_keyMethod · 0.45

Tested by

no test coverage detected