, 1 min read
Parallel Mass-File Processing
Task at hand: Process ca. 400,000 files. In our case each file needed to be converted from EBCDIC to ASCII.
Obviously, you could do this sequentially. But having a multiprocessor machine you should make use of all processing power. The chosen approach is as follows:
- Generate a list of all files to be processed, i.e., file with all filenames, henceforth called
fl. For example:
find . -mindepth 2 > fl
flinto 32 parts ("chunks"):
split -nl/32 fl fl\.
- Each chunk is now processed in parallel:
for i in fl.??; do processEachChunk $i & done
In our case each file is processed as below, i.e.,
processEachChunk looks like:
T=/tmp/mvscvtInp.$$ while read fn; do #echo $fn if [ -f $fn ]; then mv "$fn" $T || echo "Error: i=|$fn|, T=|$T|" mvscvt -a < $T > "$fn" fi done
mvscvt is the homegrown program to convert EBCDIC to ASCII. If your EBCDIC files are not special in any way then you can use
dd conv=ascii if=...
If possible, i.e., if all data fits into main memory, do this operation on a RAM disk. On Arch Linux
/tmp is mounted as
tmpfs, i.e., a RAM disk.
Added 13-Apr-2023: Alternative route. Split into 64 files with all the filenames:
find . -type f > fl split -nl/64 fl flsp
Now run a program, which can handle multiple arguments, and therefore does not need to be started over and over again.
for i in flsp*; do mvscvt -ar `cat $i` & done