java - Hadoop gzip compressed files -
i new hadoop , trying process wikipedia dump. it's 6.7 gb gzip compressed xml file. read hadoop supports gzip compressed files can processed mapper on single job 1 mapper can decompress it. seems put limitation on processing. there alternative? decompressing , splitting xml file multiple chunks , recompressing them gzip.
i read hadoop gzip http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html
thanks help.
a file compressed gzip codec cannot split because of way codec works. single split in hadoop can processed single mapper; single gzip file can processed single mapper.
there atleast 3 ways of going around limitation:
- as preprocessing step: uncompress file , recompress using splittable codec (lzo)
- as preprocessing step: uncompress file, split smaller sets , recompress. (see this)
- use patch hadoop (which wrote) allows way around this: splittable gzip
hth
Comments
Post a Comment