c - What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs? -
we have problem embarrassingly parallel - run large number of instances of single program different data set each; submitting application many times batch queue different parameters each time.
however large number of jobs, not of them complete. not appear problem in queue - of jobs started.
the issue appears large number of instances of application running, lots of jobs finish @ same time , try write out data parallel file-system @ pretty same time.
the issue seems either program unable write file-system , crashes in manner, or sits there waiting write , batch queue system kills job after it's been sat waiting long. (from have gathered on problem, of jobs fail complete, if not all, not leave core files)
what best way schedule disk-writes avoid problem? mention our program embarrassingly parallel highlight fact each process not aware of others - cannot talk each other schedule writes in manner.
although have source-code program, we'd solve problem without having modify if possible don't maintain or develop (plus of comments in italian).
i have had thoughts on matter:
- each job write local (scratch) disk of node @ first. can run job checks every , jobs have completed , moves files local disks parallel file-system.
- use mpi wrapper around program in master/slave system, master manages queue of jobs , farms these off each slave; , slave wrapper runs applications , catches exception (could reliably file-system timeout in c++, or possibly java?), , sends message master re-run job
in meantime need pester supervisors more information on error - i've never run personally, haven't had use program large number of datasets (yet).
in case it's useful: run solaris on our hpc system sge (sun gridengine) batch queue system. file-system nfs4, , storage servers run solaris. hpc nodes , storage servers communicate on fibre channel links.
most parallel file systems, particularly @ supercomputing centres, targetted hpc applications, rather serial-farm type stuff. result, they're painstakingly optimized bandwidth, not iops (i/o operations per sec) - is, aimed @ big (1000+ process) jobs writing handful of mammoth files, rather zillions of little jobs outputting octillions of tiny little files. easy users run runs fine(ish) on desktop , naively scale hundreds of simultaneous jobs starve system of iops, hanging jobs , typically others on same systems.
the main thing can here aggregate, aggregate, aggregate. best if tell you're running can more information on system. tried-and-true strategies:
- if outputting many files per job, change output strategy each job writes out 1 file contains others. if have local ramdisk, can simple writing them ramdisk, tar-gzing them out real filesystem.
- write in binary, not in ascii. big data never goes in ascii. binary formats ~10x faster write, smaller, , can write big chunks @ time rather few numbers in loop, leads to:
- big writes better little writes. every io operation file system has do. make few, big, writes rather looping on tiny writes.
- similarly, don't write in formats require seek around write in different parts of file @ different times. seeks slow , useless.
- if you're running many jobs on node, can use same ramdisk trick above (or local disk) tar jobs' outputs , send them out parallel file system @ once.
the above suggestions benefit i/o performance of code everywhere, not juston parallel file systems. io slow everywhere, , more can in memory , fewer actual io operations execute, faster go. systems may more sensitive others, may not notice on laptop, help.
similarly, having fewer big files rather many small files speed directory listings backups on filesystem; around.
Comments
Post a Comment