java - Hadoop gzip compressed files -


i new hadoop , trying process wikipedia dump. it's 6.7 gb gzip compressed xml file. read hadoop supports gzip compressed files can processed mapper on single job 1 mapper can decompress it. seems put limitation on processing. there alternative? decompressing , splitting xml file multiple chunks , recompressing them gzip.

i read hadoop gzip http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

thanks help.

a file compressed gzip codec cannot split because of way codec works. single split in hadoop can processed single mapper; single gzip file can processed single mapper.

there atleast 3 ways of going around limitation:

  1. as preprocessing step: uncompress file , recompress using splittable codec (lzo)
  2. as preprocessing step: uncompress file, split smaller sets , recompress. (see this)
  3. use patch hadoop (which wrote) allows way around this: splittable gzip

hth


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -