, 2 min read

Text Analysis using Concordance

When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.

Assume your text is provided as PDF. Convert PDF to text using pdftotext, which is part of package poppler. Replace line breaks in text file with spaces using below C program (called linebreak.c):

#include <stdio.h>

int main(int argc, char *argv[]) {
        int c, flag=0;
        FILE *fp;

        if (argc >= 2) {
                if ((fp = fopen(argv[1],"rb")) == NULL)
                        return 1;
        } else {
                fp = stdin;
        }

        while ((c = fgetc(fp)) != EOF) {
                if (c == '\n') {
                        flag += 1;
                        if (flag > 1) { putchar(c); flag = 0; }
                        else putchar(' ');
                } else {
                        flag = 0;
                        putchar(c);
                }
        }

        return 0;
}

Then generate a list of (single) words with below Perl program:

#!/bin/perl -W
# Print word concordances

use strict;

my (%H,@F);

while () {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        foreach my $w (@F) {
                $w =~ s/^\s+//; # ltrim
                $w =~ s/\s+$//; # rtrim
                $H{$w} += 1;
        }
}

foreach my $w (sort keys %H) {
        printf("\t%6d\t%s\n",$H{$w},$w);
}

To print all word pairs replace above loop with

while () {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1]} += 1;
        }
}

Similar, for word triples replace the loop with

while () {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i+1<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $F[$i+2] =~ s/^\s+//;   # ltrim
                $F[$i+2] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
        }
}

Printing concordances using Perl hashes is very simple, as one can see.

Here is an example from the man-page of expect using below sequence of commands:

( TERM=dumb; man expect ) | linebreak | word3concord | sort -r

Truncated result is

            16  For example, the
            13  example, the following
            12  the current process.
             9  the end of
             8  using Expectk, this
             8  this option is
             8  sent to the
             8  flag causes the
             8  body is executed
             8  Expectk, this option
             8  (When using Expectk,
             7  to the current
             7  the spawn id
             7  the most recent
             7  the current process
             7  the corresponding body
             7  option is specified
             7  is specified as
             7  corresponding body is
             7  by Don Libes,
             7  be used to
             6  set for the
             6  of the current
             6  is set for
             6  is an alias