13th December 2021

Filtering Bots and Crawlers from Access.log

1. Problem statement. When you run a web-server on your machine many bots and crawlers will visit. When analysing how many "real" visitors you have, you should therefore suppress these entries in your analysis from your log file of the web-server.

This blog is served by Hiawatha web-server. Every visitor writes at least one entry into the web-server's log file. Hiawatha and many other web-servers call it access.log. Hiawatha has other log files, e.g., error.log, garbage.log, and system.log. An entry in access.log looks something like this:

192.168.178.24|Mon 13 Dec 2021 11:37:03 +0100|200|6148|GET /blog/2014/09-20-advances-in-automotive-technology/ HTTP/1.1||Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Cache-Control: max-age=0|Upgrade-Insecure-Requests: 1|Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9
192.168.178.24|Mon 13 Dec 2021 11:37:04 +0100|200|168499|GET /img/RynoSingleWheel.jpg HTTP/1.1|http://nucsaaze/blog/2014/09-20-advances-in-automotive-technology/|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9
192.168.178.24|Mon 13 Dec 2021 11:37:04 +0100|304|182|GET /img/LitMotorBike.jpg HTTP/1.1|http://nucsaaze/blog/2014/09-20-advances-in-automotive-technology/|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9|If-Modified-Since: Mon, 13 Dec 2021 10:14:34 GMT
171.25.193.77|Mon 13 Dec 2021 12:58:31 +0100|200|6419|GET / HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: 94.114.1.108:8443|Connection: close|Accept: */*|Accept-Encoding: gzip
185.220.100.242|Mon 13 Dec 2021 12:58:36 +0100|200|2456|GET /favicon.ico HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: 94.114.1.108:8443|Connection: close|Accept: */*|Accept-Encoding: gzip

Hiawatha's access.log has the following fields, separated by |:

  1. host
  2. date
  3. code
  4. size
  5. URL
  6. referer
  7. user agent
  8. and other fields

2. Examples. Here is the Google-Bot, seen in access.log, which is a highly welcomed crawler:

66.249.76.169|Fri 10 Dec 2021 12:32:14 +0100|200|7037|GET /blog/2021/07-13-performance-comparison-c-vs-java-vs-javascript-vs-luajit-vs-pypy-vs-php-vs-python-vs-perl/ HTTP/1.1||Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)|Host: eklausmeier.goip.de|Connection: keep-alive|Accept: text/html,application/xhtml+xml,application/signed-exchange;v=b3,application/xml;q=0.9,*/*;q=0.8|From: googlebot(at)googlebot.com|Accept-Encoding: gzip, deflate, br|If-Modified-Since: Fri, 19 Nov 2021 04:23:35 GMT

Here are two entries of the Yandex-Bot, also highly welcomed:

5.255.253.106|Wed 08 Dec 2021 12:06:43 +0100|200|2761|GET /blog/2014/01-19-cisco-2014-annual-security-report-java-continues-to-be-most-vulnerable-of-all-web-exploits HTTP/1.1||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)|Host: eklausmeier.goip.de|Connection: keep-alive|From: support@search.yandex.ru|Accept-Encoding: gzip,deflate|Accept: */*
213.180.203.141|Wed 08 Dec 2021 12:06:44 +0100|200|2434|GET /blog/2013/12-08-surfing-the-internet-with-100-mbits HTTP/1.1||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)|Host: eklausmeier.goip.de|Connection: keep-alive|From: support@search.yandex.ru|Accept-Encoding: gzip,deflate|Accept: */*

Here is an example of a bot, whose purpose is doubtful and probably just nonsense:

45.155.205.233|Wed 08 Dec 2021 19:45:48 +0100|303|228|GET /index.php?s=/Index/\think\app/invokefunction&function=call_user_func_array&vars[0]=md5&vars[1][]=HelloThinkPHP21 HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36|Host: 94.114.1.108:443|Accept-Encoding: gzip|Connection: close

Some of the bots and crawlers identify themselves in the user-agent field, e.g., Google and Yandex. But unfortunately, many do not. Quite the contrary, they try to hide as normal browsers.

Below Perl script accesslogFilter filters access.log so that only the "real" visitors remain. The script filters according:

  1. IP addresses: some bots and crawlers do not identify themselves, so we have to resort to naked addresses
  2. HTTP status codes
  3. identifying string in the user-agent field

3. Configuration. Here is the list of IP addresses I found to be "annoying" in any analysis:

my %ips = (
    '18.212.118.57' => 0, '18.232.89.176' => 0, '3.94.81.106' => 0,
    '34.239.184.105' => 0, '54.80.126.99' => 0, '54.162.60.209' => 0,       # compute-1.amazonaws.com
    '62.138.2.14' => 0, '62.138.2.214' => 0, '62.138.2.160' => 0,
    '62.138.3.52' => 0, '62.138.6.15' => 0, '85.25.210.23' => 0,    # startdedicated.de
    '66.240.192.138' => 0, '66.240.219.133' => 0, '66.240.219.146' => 0, '66.240.236.119' => 0,
    '71.6.135.131' => 0, '71.6.146.185' => 0, '71.6.158.166' => 0, '71.6.165.200' => 0,
    '71.6.167.142' => 0, '71.6.199.23' => 0, '80.82.77.33' => 0, '80.82.77.139' => 0,
    '82.221.105.6' => 0, '82.221.105.7' => 0,       # shodan.io
    '138.246.253.24' => 0, '106.55.250.60' => 0,    # various robots.txt reader
    '127.0.0.1' => 0,       # localhost
    '192.168.178.2' => 0, '192.168.178.20' => 0, '192.168.178.24' => 0,
    '192.168.178.118' => 0, '192.168.178.249' => 0  # local network
);

For example, shodan.io does not identify itself in the user-agent string. Therefore I simply identified them by using nslookup:

$ nslookup 82.221.105.7
7.105.221.82.in-addr.arpa       name = census11.shodan.io.

Here is the list of HTTP status codes, which are filtered out:

my %errorCode = ( 301 => 0, 302 => 0, 400 => 0, 403 => 0, 404 => 0, 405 => 0, 500 => 0, 501 => 0, 503 => 0, 505 => 0 );

Status code 404 would also be very useful to find links on my web-server which point to nowhere. But the bots are so numerous that it seems to be more beneficial to just filter it out.

Here is the list of user-agent strings, which identify a bot or crawler:

my %bots = (
    adsbot => 0, adscanner => 0, ahrefsbot => 0, applebot => 0, 'archive.org_bot' => 0,
    baiduspider => 0, bingbot => 0, blexbot => 0,
    ccbot => 0, censysinspect => 0,
    'clark-crawler2' => 0, crawler => 0, crawler2 => 0, 'crawler.php' => 0,
    criteobot => 0, curl => 0,
    dataforseobot => 0, dotbot => 0, 'feedsearch-crawler' => 0,
    facebookexternalhit => 0, fluid => 0, 'go-http-client' => 0,
    googlebot => 0, ioncrawl => 0, lighthouse => 0, ltx71 => 0,
    mediapartners => 0, 'mediapartners-google' => 0,
    netestate => 0, nuclei => 0, nicecrawler => 0, petalbot => 0,
    proximic => 0, 'pulsepoint-ads.txt-crawler' => 0, 'python-requests' => 0,
    semrushbot => 0, 'semrushbot-ba' => 0, sitecheckerbotcrawler => 0, sitelockspider => 0,
    twitterbot => 0, ucrawl => 0,
    virustotal => 0, 'web-crawler' => 0, 'webmeup-crawler' => 0,
    'x-fb-crawlerbot' => 0, yandexbot => 0,
);

4. Perl script. With those configurations in place, the rest of the Perl script accesslogFilter is pretty straightforward:

use strict;

use Getopt::Std;
my %opts = ();
getopts('o:',\%opts);
my $statout = (defined($opts{'o'}) ? $opts{'o'} : undef);
my ($emptyUA,$badURL,$smallUA) = (0,0,0);

Put the above config, i.e., the hash tables after these declarations. Now the actual filtering-code:

W: while (<>) {
        my @F = split /\|/;
        if (defined($ips{$F[0]})) { $ips{$F[0]} += 1; next; }
        if ($#F <= 5) { $emptyUA += 1; next; }
        if (defined($errorCode{$F[2]})) { $errorCode{$F[2]} += 1; next; }       # not found errors are ignored
        if ($F[4] =~ /XDEBUG_SESSION_START|HelloThink(CMF|PHP)/) { $badURL += 1; next; }
        if ($#F >= 6) { # Is UA field available?
                if (length($F[6]) <= 3) { $smallUA += 1; next W; }
                for ( split(/[ :;,\/\(\)\@]/,lc $F[6]) ) {
                        if (defined($bots{$_})) { $bots{$_} += 1; next W; }     # skip bots
                }
        }
        print;
}

For reporting purposes only, i.e., when given command line option -o report.txt, not required for filtering:

if (defined($statout)) {
        open(F,">$statout") || die("Cannot write to $statout");
        for (sort keys %ips) {
                next if ($ips{$_} == 0);
                printf(F "IP\t%d\t%s\n",$ips{$_},$_);
        }
        printf(F "eUA\t%d\n",$emptyUA);
        printf(F "badURL\t%d\n",$badURL);
        for (sort keys %errorCode) {
                next if ($errorCode{$_} == 0);
                printf(F "code\t%d\t%d\n",$errorCode{$_},$_);
        }
        printf(F "sUA\t%d\n",$smallUA);
        for (sort keys %bots) {
                next if ($bots{$_} == 0);
                printf(F "bot\t%d\t%s\n",$bots{$_},$_);
        }
}

The script is in GitHub: eklausme/bin/accesslogFilter.

5. Reporting. Now to get a feeling how much filtering actually happens when applying above rules:

  1. Unmodified access.log for ca. one year: 181,282 entries
  2. Filtered access.log with accesslogFilter: 32,451 entries remain, i.e., less than 20%

So 80% of the visits of my web-server is stemming from bots, crawlers, junk, or from myself.

Unfiltered output of goaccess looks like this:

Filtered output of goaccess looks like this:

I have written about goacess here: Using GoAccess with Hiawatha Web-Server. Unfortunately goaccess is not good at filtering.

The statistics of accesslogFilter are below. First statistics for filtering according IP address. Host "startdedicated" is at the top:

IP	704	106.55.250.60
IP	154	127.0.0.1
IP	419	138.246.253.24
IP	335	18.212.118.57
IP	333	18.232.89.176
IP	1753	192.168.178.118
IP	1341	192.168.178.2
IP	1554	192.168.178.20
IP	6180	192.168.178.24
IP	496	192.168.178.249
IP	331	3.94.81.106
IP	335	34.239.184.105
IP	335	54.162.60.209
IP	329	54.80.126.99
IP	282	62.138.2.14
IP	6757	62.138.2.160
IP	286	62.138.2.214
IP	1353	62.138.3.52
IP	631	62.138.6.15
IP	11	66.240.192.138
IP	14	66.240.219.133
IP	6	66.240.219.146
IP	17	66.240.236.119
IP	32	71.6.135.131
IP	10	71.6.146.185
IP	14	71.6.158.166
IP	5	71.6.165.200
IP	10	71.6.167.142
IP	17	71.6.199.23
IP	32	80.82.77.139
IP	44	80.82.77.33
IP	52	82.221.105.6
IP	41	82.221.105.7
IP	15688	85.25.210.23

Above numbers are depicted in below pie-chart:

Special filtering according user-agent or bad/silly URL:

eUA	143
badURL	1835
sUA	3580

Distribution of filtered HTTP status code. As mentioned above, code 404 dominates by far:

code	16033	301
code	15	302
code	16	400
code	67	403
code	48222	404
code	3877	405
code	91	500
code	126	501
code	981	503

Above numbers as pie-chart:

Statistics on filtered entries according user-agent strings. Google is at the top, followed by Semrush:

bot	326	adsbot
bot	932	adscanner
bot	5163	ahrefsbot
bot	325	applebot
bot	18	archive.org_bot
bot	426	baiduspider
bot	1011	bingbot
bot	1105	blexbot
bot	2	ccbot
bot	1131	censysinspect
bot	182	clark-crawler2
bot	39	crawler
bot	88	crawler.php
bot	4	criteobot
bot	196	curl
bot	379	dataforseobot
bot	1688	dotbot
bot	8	facebookexternalhit
bot	1	feedsearch-crawler
bot	16	fluid
bot	226	go-http-client
bot	7016	googlebot
bot	34	ioncrawl
bot	19	ltx71
bot	2059	mediapartners-google
bot	410	netestate
bot	1	nicecrawler
bot	346	nuclei
bot	1088	petalbot
bot	77	proximic
bot	4	pulsepoint-ads.txt-crawler
bot	342	python-requests
bot	6609	semrushbot
bot	159	semrushbot-ba
bot	26	sitecheckerbotcrawler
bot	699	sitelockspider
bot	68	twitterbot
bot	2	ucrawl
bot	4	virustotal
bot	1715	yandexbot

Above data as pie-chart.

This gives a good graphical presentation why Bing, or Baidu are way inferior to Google search.

6. References. Below links might provide further information on bots & crawlers.

  1. Web Crawlers: Love the Good, but Kill the Bad and the Ugly: This post talks about limiting the bots & crawlers to your web-site as they are slowing down the entire server. The author mentions facebookexternalhit visiting his site 541 times per hour!
  2. IAB/ABC International Spiders and Bots List: A commercial list with bots, spiders, and crawlers. The list costs 15,000 USD.
  3. Bots and the Adobe Experience Cloud: AEC uses IAB.
  4. List of bots in StopBadBots: aBots.php
  5. AWStats robot list: robots.pm



Categories: web
Tags: web-server, access.log, bot, crawler
Author: Elmar Klausmeier