python - Unicode and UTF-8 encoding issue with Scrapy XPath selector text -


i'm using scrapy , python (as part of django project) scrape site german content. have libxml2 installed backend scrapy selectors.

if extract word 'hüftsitz' (this how displayed on site) through selectors, get: u'h\ufffd\ufffdftsitz' (scrapy xpath selectors return unicode strings).

if encode utf-8, get: 'h\xef\xbf\xbd\xef\xbf\xbdftsitz'. , if print that, 'h??ftsitz' isn't correct. wondering why may happening.

the character-set on site set utf-8. testing above on python shell sys.getdefaultencoding set utf-8. using django application data xpath selectors written mysql database utf-8 character set, see same behaviour.

am overlooking obvious here? clues or appreciated.

u'\ufffd' "unicode replacement character", printed question mark inside black triangle. not u umlaut. problem must somewhere upstream. check encoding web page headers being returned , verify in fact, says is.

the unicode replacement character inserted replacement illegal or unrecognized character, caused several things, likeliest encoding not claims be.


Comments

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -