, 3 min read
AWStats and Hiawatha
Hiawatha is a secure and reliable web-server. It is used for this blog. AWStats is a collection of Perl-scripts to analyze log-files from web-servers. By default, AWStats can read Apache log-files. It cannot directly read log-files from Hiawatha.
The Hiawatha log-file format is:
- host, this is the IP address
- date + time
- HTTP status code, e.g., 200
- size in bytes
- URL including the method (GET/POST/etc.)
- referrer
- user agent, e.g., Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0
- and other fields
I already use GoAccess to analyze my log-files. See Using GoAccess with Hiawatha Web-Server. GoAccess is pretty fast and the output looks good, but it cannot really filter the data. I.e., it shows huge amounts of data generated by bots. So I hoped that AWStats could fill this gap.
1. Modifying Perl program. I first tried to configure AWStats in
/etc/awstats/awstats.eklausmeier.goip.de.conf
Unfortunately, even after preformatting the Hiawatha log-files this didn't work. So I had to change to source code in awstats.pl
.
I created a new LogFormat=5:
elsif ( $LogFormat eq '5' ) { # Hiawatha web-server log-format
$PerlParsingFormat = "special Hiawatha web-server log-format";
$pos_host = 0;
$pos_date = 1;
$pos_code = 2;
$pos_size = 3;
$pos_method = 4; # together with url
$pos_url = 5; # together with method
$pos_referer = 6;
$pos_agent = 7;
@fieldlib = (
'host', 'date', 'code', 'size',
'method', 'url', 'referer', 'ua'
);
}
There are two places in awstats.pl
, which actually read from the log-file. These two places I changed with a call to a small subroutine, which can then handle Hiawatha log-files natively and without hassle.
# split log line into fields
sub splitLog (@) {
my ($PerlParsingFormat,$line) = (@_[0],@_[1]);
if ($PerlParsingFormat eq '(?^:^special Hiawatha web-server log-format)') {
my @F = split('\|',$line);
my @R;
($R[0],$R[2],$R[3],$R[6],$R[7]) = ($F[0],$F[2],$F[3],$F[5],$F[6]);
my ($day,$month,$year,$hms) = ($F[1] =~ /\w\w\w\s+(\d+)\s+(\w+)\s+(\d+)\s+(\d+:\d+:\d+)/);
$R[1] = sprintf("%02d/%s/%04d:%s",$day,$month,$year,$hms); # DD/Month/YYYY:HH:MM:SS (Apache)
($R[4],$R[5]) = ($F[4] =~ /^(\w+)\s+([^\s]+)\s+[^\s]+$/); # GET /index.html HTTP/x.x
return @R;
}
return map( /$PerlParsingFormat/, $line );
}
This subroutine is then called
@field = splitLog($PerlParsingFormat,$line);
instead of
@field = map( /$PerlParsingFormat/, $line );
and instead of
if ( !( @field = splitLog($PerlParsingFormat,$line) ) ) { #map( /$PerlParsingFormat/, $line )
In total this occurs two times in awstats.pl
.
2. Configuring AWStats. To actually run awstats.pl
you have to symlink the lib-directory first:
ln -s /usr/share/webapps/awstats/cgi-bin/lib lib
assuming that the AWStat package was installed under /usr/share/webapps/awstats
.
In /etc/awstats/awstats.eklausmeier.goip.de.conf
I set:
LogFile="/tmp/access.log"
LogType=W
LogFormat=5
LogSeparator="\|"
SiteDomain="eklausmeier.goip.de"
DNSLookup=2
DirIcons="/awstatsicon"
You have to set a symbolic link in your web-root:
ln -s /usr/share/webapps/awstats/icon awstatsicon
3. Running AWStats. Starting AWStats is thus:
./awstats.pl -config=eklausmeier.goip.de -output -staticlinks > /srv/http/awstats.html
Generating all the reports:
/usr/share/awstats/tools/awstats_buildstaticpages.pl -config=eklausmeier.goip.de -dir=/srv/http
Hiawatha splits the log-file and gzip's them. To concat them all together, use something like:
L=/tmp/access.log; rm $L; for i in `seq 52 -1 2`; do zcat access.log.$i.gz >> $L; done; cat access.log.1 access.log >> $L
4. Example output. Output of AWStats for the overview for spiders and bots looks similar to this:
The detailled overview of the most requested URLs looks similar to this:
The list of used operating systems looks like this:
Added 02-Apr-2022: AWStats uses the heuristic to detect bots by examining access to robots.txt. Those machines, which access robots.txt, must be bots. It is one of many ways to detect bots in AWStats. Well, it turns out many bots do not bother to look at robots.txt. So this heuristic is not very reliable. Just for the records, Google and Yandex do honor robots.txt.