hub / github.com/pyload/pyload / _getCharacterEncoding

Function _getCharacterEncoding

module/lib/feedparser.py:3372–3510 · view source on GitHub ↗

Get the character encoding of the XML document http_headers is a dictionary xml_data is a raw string (not Unicode) This is so much trickier than it sounds, it's not even funny. According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type is application/xml, applicati

(http_headers, xml_data)

Source from the content-addressed store, hash-verified

3370	return None
3371
3372	def _getCharacterEncoding(http_headers, xml_data):
3373	'''Get the character encoding of the XML document
3374
3375	http_headers is a dictionary
3376	xml_data is a raw string (not Unicode)
3377
3378	This is so much trickier than it sounds, it's not even funny.
3379	According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
3380	is application/xml, application/*+xml,
3381	application/xml-external-parsed-entity, or application/xml-dtd,
3382	the encoding given in the charset parameter of the HTTP Content-Type
3383	takes precedence over the encoding given in the XML prefix within the
3384	document, and defaults to 'utf-8' if neither are specified. But, if
3385	the HTTP Content-Type is text/xml, text/*+xml, or
3386	text/xml-external-parsed-entity, the encoding given in the XML prefix
3387	within the document is ALWAYS IGNORED and only the encoding given in
3388	the charset parameter of the HTTP Content-Type header should be
3389	respected, and it defaults to 'us-ascii' if not specified.
3390
3391	Furthermore, discussion on the atom-syntax mailing list with the
3392	author of RFC 3023 leads me to the conclusion that any document
3393	served with a Content-Type of text/* and no charset parameter
3394	must be treated as us-ascii. (We now do this.) And also that it
3395	must always be flagged as non-well-formed. (We now do this too.)
3396
3397	If Content-Type is unspecified (input was local file or non-HTTP source)
3398	or unrecognized (server just got it totally wrong), then go by the
3399	encoding given in the XML prefix of the document and default to
3400	'iso-8859-1' as per the HTTP specification (RFC 2616).
3401
3402	Then, assuming we didn't find a character encoding in the HTTP headers
3403	(and the HTTP Content-type allowed us to look in the body), we need
3404	to sniff the first few bytes of the XML data and try to determine
3405	whether the encoding is ASCII-compatible. Section F of the XML
3406	specification shows the way here:
3407	http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
3408
3409	If the sniffed encoding is not ASCII-compatible, we need to make it
3410	ASCII compatible so that we can sniff further into the XML declaration
3411	to find the encoding attribute, which will tell us the true encoding.
3412
3413	Of course, none of this guarantees that we will be able to parse the
3414	feed in the declared character encoding (assuming it was declared
3415	correctly, which many are not). CJKCodecs and iconv_codec help a lot;
3416	you should definitely install them if you can.
3417	http://cjkpython.i18n.org/
3418	'''
3419
3420	def _parseHTTPContentType(content_type):
3421	'''takes HTTP Content-Type header and returns (content type, charset)
3422
3423	If no charset is specified, returns (content type, '')
3424	If no content type is specified, returns ('', '')
3425	Both return parameters are guaranteed to be lowercase strings
3426	'''
3427	content_type = content_type or ''
3428	content_type, params = cgi.parse_header(content_type)
3429	return content_type, params.get('charset', '').replace("'", '')

Callers 1

parseFunction · 0.85

Calls 10

_parseHTTPContentTypeFunction · 0.85

_l2bytesFunction · 0.85

_ebcdic_to_asciiFunction · 0.85

_s2bytesFunction · 0.85

compileMethod · 0.80

decodeMethod · 0.80

getMethod · 0.45

encodeMethod · 0.45

matchMethod · 0.45

has_keyMethod · 0.45

Tested by

no test coverage detected