zipfile - Strange "BadZipfile: Bad CRC-32" problem -


this code simplification of code in django app receives uploaded zip file via http multi-part post , read-only processing of data inside:

#!/usr/bin/env python  import csv, sys, stringio, traceback, zipfile try:     import io except importerror:     sys.stderr.write('could not import `io` module.\n')  def get_zip_file(filename, method):     if method == 'direct':         return zipfile.zipfile(filename)     elif method == 'stringio':         data = file(filename).read()         return zipfile.zipfile(stringio.stringio(data))     elif method == 'bytesio':         data = file(filename).read()         return zipfile.zipfile(io.bytesio(data))   def process_zip_file(filename, method, open_defaults_file):     zip_file    = get_zip_file(filename, method)     items_file  = zip_file.open('items.csv')     csv_file    = csv.dictreader(items_file)      try:         idx, row in enumerate(csv_file):             image_filename = row['image1']              if open_defaults_file:                 z = zip_file.open('defaults.csv')                 z.close()          sys.stdout.write('processed %d items.\n' % idx)     except zipfile.badzipfile:         sys.stderr.write('processing failed on item %d\n\n%s'                           % (idx, traceback.format_exc()))   process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3])) 

pretty simple. open zip file , 1 or 2 csv files inside zip file.

what's weird if run large zip file (~13 mb) , have instantiate zipfile stringio.stringio or io.bytesio (perhaps other plain filename? had similar problems in django app when trying create zipfile temporaryuploadedfile or file object created calling os.tmpfile() , shutil.copyfileobj()) , have open 2 csv files rather one, fails towards end of processing. here's output see on linux system:

$ ./test_zip_file.py ~/data.zip direct 1 processed 250 items.  $ ./test_zip_file.py ~/data.zip stringio 1 processing failed on item 242  traceback (most recent call last):   file "./test_zip_file.py", line 26, in process_zip_file     idx, row in enumerate(csv_file):   file ".../python2.7/csv.py", line 104, in next     row = self.reader.next()   file ".../python2.7/zipfile.py", line 523, in readline     return io.bufferediobase.readline(self, limit)   file ".../python2.7/zipfile.py", line 561, in peek     chunk = self.read(n)   file ".../python2.7/zipfile.py", line 581, in read     data = self.read1(n - len(buf))   file ".../python2.7/zipfile.py", line 641, in read1     self._update_crc(data, eof=eof)   file ".../python2.7/zipfile.py", line 596, in _update_crc     raise badzipfile("bad crc-32 file %r" % self.name) badzipfile: bad crc-32 file 'items.csv'  $ ./test_zip_file.py ~/data.zip bytesio 1 processing failed on item 242  traceback (most recent call last):   file "./test_zip_file.py", line 26, in process_zip_file     idx, row in enumerate(csv_file):   file ".../python2.7/csv.py", line 104, in next     row = self.reader.next()   file ".../python2.7/zipfile.py", line 523, in readline     return io.bufferediobase.readline(self, limit)   file ".../python2.7/zipfile.py", line 561, in peek     chunk = self.read(n)   file ".../python2.7/zipfile.py", line 581, in read     data = self.read1(n - len(buf))   file ".../python2.7/zipfile.py", line 641, in read1     self._update_crc(data, eof=eof)   file ".../python2.7/zipfile.py", line 596, in _update_crc     raise badzipfile("bad crc-32 file %r" % self.name) badzipfile: bad crc-32 file 'items.csv'  $ ./test_zip_file.py ~/data.zip stringio 0 processed 250 items.  $ ./test_zip_file.py ~/data.zip bytesio 0 processed 250 items. 

incidentally, code fails under same conditions in different way on os x system. instead of badzipfile exception, seems read corrupted data , gets confused.

this suggests me doing in code not supposed -- e.g.: call zipfile.open on file while having file within same zip file object open? doesn't seem problem when using zipfile(filename), perhaps it's problematic when passing zipfile file-like object, because of implementation details in zipfile module?

perhaps missed in zipfile docs? or maybe it's not documented yet? or (least likely), bug in zipfile module?

i might have found problem , solution, unfortunately had replace python's zipfile module hacked 1 of own (called myzipfile here).

$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py --- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800 +++ myzipfile.py        2011-04-11 11:51:59.000000000 -0700 @@ -5,6 +5,7 @@  import binascii, cstringio, stat  import io  import re +import copy   try:      import zlib # may need compression method @@ -877,7 +878,7 @@          # open new file instances not          # given file object in constructor          if self._filepassed: -            zef_file = self.fp +            zef_file = copy.copy(self.fp)          else:              zef_file = open(self.filename, 'rb') 

the problem in standard zipfile module when passed file object (not filename), uses same passed-in file object every call open method. means tell , seek getting called on same file , trying open multiple files within zip file causing file position shared , multiple open calls result in them stepping on each other. in contrast, when passed filename, open opens new file object. solution case when file object passed in, instead of using file object directly, create copy of it.

this change zipfile fixes problems seeing:

$ ./test_zip_file.py ~/data.zip stringio 1 processed 250 items.  $ ./test_zip_file.py ~/data.zip bytesio 1 processed 250 items.  $ ./test_zip_file.py ~/data.zip direct 1 processed 250 items. 

but don't know if has other negative impacts on zipfile...

edit: found mention of in python docs had somehow overlooked before. @ http://docs.python.org/library/zipfile.html#zipfile.zipfile.open, says:

note: if zipfile created passing in file-like object first argument constructor, object returned open() shares zipfile’s file pointer. under these circumstances, object returned open() should not used after additional operations performed on zipfile object. if zipfile created passing in string (the filename) first argument constructor, open() create new file object held zipextfile, allowing operate independently of zipfile.


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -