java - Hadoop gzip compressed files -

- April 15, 2013

i new hadoop , trying process wikipedia dump. it's 6.7 gb gzip compressed xml file. read hadoop supports gzip compressed files can processed mapper on single job 1 mapper can decompress it. seems put limitation on processing. there alternative? decompressing , splitting xml file multiple chunks , recompressing them gzip.

i read hadoop gzip http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

thanks help.

a file compressed gzip codec cannot split because of way codec works. single split in hadoop can processed single mapper; single gzip file can processed single mapper.

there atleast 3 ways of going around limitation:

as preprocessing step: uncompress file , recompress using splittable codec (lzo)
as preprocessing step: uncompress file, split smaller sets , recompress. (see this)
use patch hadoop (which wrote) allows way around this: splittable gzip

hth

Search This Blog

Score

java - Hadoop gzip compressed files -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -