22nd February 2018

Unix Command comm: Compare Two Files

One lesser known Unix command is comm. This command is far less known than diff. comm needs two already sorted files FILE1 and FILE2. With the options

  1. -1 suppress column 1 (lines unique to FILE1)
  2. -2 suppress column 2 (lines unique to FILE2)
  3. -3 suppress column 3 (lines that appear in both files)

For example, comm -12 F1 F2 prints all common lines in files F1 and F2.

I thought that comm had a bug, so I wrote a short Perl script to simulate the behaviour of comm. Of course, there was no bug, I just missed to notice that the records in the two files did not match due to white space.

#!/bin/perl -W
use strict;

use Getopt::Std;
my %opts = ('d' => 0, 's' => 0);
getopts('ds:',\%opts);
my $debug = ($opts{'d'} != 0);
my $member = defined($opts{'s'}) ? $opts{'s'} : 0;

my ($set,$prev) = (1,"");
my %H;

while (<>) {
        $prev = $ARGV if ($prev eq "");
        if ($ARGV ne $prev) {
                $set *= 2;
                $prev = $ARGV;
        }
        chomp;
        $H{$_} |= $set;
        printf("\t>>\t%s: %s -> %d\n",$ARGV,$_,$H{$_}) if ($debug);
}

$member = 2*$set - 1 if ($member == 0);
printf("\t>>\tmember = %d\n",$member) if ($debug);
for my $i (sort keys %H) {
        printf("%s\n",$i) if ($H{$i} == $member);
}

Above Perl scripts does not need sorted input files, as it stores all records of the files in memory, in a hash. It uses a bitmask as a set. For example, mycomm -s2 F1 F2 prints only those records, which are only in file F2 but not in F1.




Categories: Perl, programming
Tags: ,
Author: Elmar Klausmeier