2nd October 2021

Parallel Mass-File Processing

Task at hand: Process ca. 400,000 files. In our case each file needed to be converted from EBCDIC to ASCII.

Obviously, you could do this sequentially. But having a multiprocessor machine you should make use of all processing power. The chosen approach is as follows:

  1. Generate a list of all files to be processed, i.e., file with all filenames, henceforth called fl. For example: find . -mindepth 2 > fl
  2. Split fl into 32 parts ("chunks"): split -nl/32 fl fl\.
  3. Each chunk is now processed in parallel: for i in fl.??; do processEachChunk $i & done

In our case each file is processed as below, i.e., processEachChunk looks like:

while read fn; do
    #echo $fn
    if [ -f $fn ]; then
        mv "$fn" $T  ||  echo "Error: i=|$fn|, T=|$T|"
        mvscvt -a < $T > "$fn"

Here mvscvt is the homegrown program to convert EBCDIC to ASCII. If your EBCDIC files are not special in any way then you can use

dd conv=ascii if=...

instead of mvscvt.

If possible, i.e., if all data fits into main memory, do this operation on a RAM disk. On Arch Linux /tmp is mounted as tmpfs, i.e., a RAM disk.

Categories: programming
Tags: shell, split
Author: Elmar Klausmeier