, 2 min read
Text Analysis using Concordance
When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.
Assume your text is provided as PDF. Convert PDF to text using pdftotext
, which is part of package poppler
. Replace line breaks in text file with spaces using below C program (called linebreak.c
):
#include <stdio.h>
int main(int argc, char *argv[]) {
int c, flag=0;
FILE *fp;
if (argc >= 2) {
if ((fp = fopen(argv[1],"rb")) == NULL)
return 1;
} else {
fp = stdin;
}
while ((c = fgetc(fp)) != EOF) {
if (c == '\n') {
flag += 1;
if (flag > 1) { putchar(c); flag = 0; }
else putchar(' ');
} else {
flag = 0;
putchar(c);
}
}
return 0;
}
Then generate a list of (single) words with below Perl program:
#!/bin/perl -W
# Print word concordances
use strict;
my (%H,@F);
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
foreach my $w (@F) {
$w =~ s/^\s+//; # ltrim
$w =~ s/\s+$//; # rtrim
$H{$w} += 1;
}
}
foreach my $w (sort keys %H) {
printf("\t%6d\t%s\n",$H{$w},$w);
}
To print all word pairs replace above loop with
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
for(my $i=0; $i<$#F; ++$i) {
$F[$i] =~ s/^\s+//; # ltrim
$F[$i] =~ s/\s+$//; # rtrim
$F[$i+1] =~ s/^\s+//; # ltrim
$F[$i+1] =~ s/\s+$//; # rtrim
$H{$F[$i] . " " . $F[$i+1]} += 1;
}
}
Similar, for word triples replace the loop with
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
for(my $i=0; $i+1<$#F; ++$i) {
$F[$i] =~ s/^\s+//; # ltrim
$F[$i] =~ s/\s+$//; # rtrim
$F[$i+1] =~ s/^\s+//; # ltrim
$F[$i+1] =~ s/\s+$//; # rtrim
$F[$i+2] =~ s/^\s+//; # ltrim
$F[$i+2] =~ s/\s+$//; # rtrim
$H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
}
}
Printing concordances using Perl hashes is very simple, as one can see.
Here is an example from the man-page of expect
using below sequence of commands:
( TERM=dumb; man expect ) | linebreak | word3concord | sort -r
Truncated result is
16 For example, the
13 example, the following
12 the current process.
9 the end of
8 using Expectk, this
8 this option is
8 sent to the
8 flag causes the
8 body is executed
8 Expectk, this option
8 (When using Expectk,
7 to the current
7 the spawn id
7 the most recent
7 the current process
7 the corresponding body
7 option is specified
7 is specified as
7 corresponding body is
7 by Don Libes,
7 be used to
6 set for the
6 of the current
6 is set for
6 is an alias