python - Unicode and UTF-8 encoding issue with Scrapy XPath selector text -
i'm using scrapy , python (as part of django project) scrape site german content. have libxml2 installed backend scrapy selectors.
if extract word 'hüftsitz' (this how displayed on site) through selectors, get: u'h\ufffd\ufffdftsitz' (scrapy xpath selectors return unicode strings).
if encode utf-8, get: 'h\xef\xbf\xbd\xef\xbf\xbdftsitz'. , if print that, 'h??ftsitz' isn't correct. wondering why may happening.
the character-set on site set utf-8. testing above on python shell sys.getdefaultencoding set utf-8. using django application data xpath selectors written mysql database utf-8 character set, see same behaviour.
am overlooking obvious here? clues or appreciated.
u'\ufffd' "unicode replacement character", printed question mark inside black triangle. not u umlaut. problem must somewhere upstream. check encoding web page headers being returned , verify in fact, says is.
the unicode replacement character inserted replacement illegal or unrecognized character, caused several things, likeliest encoding not claims be.
Comments
Post a Comment