python - Unicode and UTF-8 encoding issue with Scrapy XPath selector text -

- June 15, 2015

i'm using scrapy , python (as part of django project) scrape site german content. have libxml2 installed backend scrapy selectors.

if extract word 'hüftsitz' (this how displayed on site) through selectors, get: u'h\ufffd\ufffdftsitz' (scrapy xpath selectors return unicode strings).

if encode utf-8, get: 'h\xef\xbf\xbd\xef\xbf\xbdftsitz'. , if print that, 'h??ftsitz' isn't correct. wondering why may happening.

the character-set on site set utf-8. testing above on python shell sys.getdefaultencoding set utf-8. using django application data xpath selectors written mysql database utf-8 character set, see same behaviour.

am overlooking obvious here? clues or appreciated.

u'\ufffd' "unicode replacement character", printed question mark inside black triangle. not u umlaut. problem must somewhere upstream. check encoding web page headers being returned , verify in fact, says is.

the unicode replacement character inserted replacement illegal or unrecognized character, caused several things, likeliest encoding not claims be.

Search This Blog

Score

python - Unicode and UTF-8 encoding issue with Scrapy XPath selector text -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -