17th January 2022

Generate RSS from HTML

As written in Generate RSS from Markdown extracting RSS from Markdown with frontmatter is simple. Now I took a slightly different approach and generate RSS from the HTML files directly.

For this blog I still want an RSS feed. Simplified Saaze does not provide this functionality. Simplified Saaze as the original Saaze is supposed to be "stupidly simple" by design, so does not suffer from feature creep.

An RSS feed contains a header with some fixed XML. Each RSS entry consists of:

  1. link / URL
  2. publication date
  3. title
  4. the full blog post content

RSS closes with some XML tags. As these blog posts are generated by Saaze from Markdown with frontmatter they contain all information that is in the Markdown with frontmatter. Of course, provided the templates pass through all this information.

Below Perl script extracts this information out of the generated HTML. In this case it extracts the title of the <h1> tag. The date is extracted when seeing some <p class>. It stops if the <footer> tag is seen.

#!/bin/perl -W
# Create RSS XML file ("feed") from Saaze generated HTML files
#
# Input: List of HTML files (order of files determines order of <item>))
# Output: RSS
#
# Example:
#      bloghtmlrss `find blog/2021 -name index.html | sort -r`

use strict;
use POSIX qw(strftime);
use POSIX qw(mktime);

my $dt = strftime("%a, %d %b %Y %H:%M:%S GMT",gmtime());	# RFC-822 format: Wed, 02 Oct 2002 13:00:00 GMT
print <<"EOT";
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
    <title>Elmar Klausmeier's Blog</title>
    <description>Elmar Klausmeier's Blog</description>
    <lastBuildDate>$dt</lastBuildDate>
    <link>https://eklausmeier.goip.de</link>
    <atom:link href="https://eklausmeier.goip.de/feed.xml" rel="self" type="application/rss+xml" />
    <generator>bloghtmlrss</generator>

EOT

my %monthNr = (	# Convert full month name to month-number minus one
    "January" => 0, "February" => 1, "March" => 2, "April" => 3,
    "May" => 4, "June" => 5, "July" => 6, "August" => 7,
    "September" => 8, "October" => 9, "November" => 10, "December" => 11
);

sub item(@) {
    my $f = $_[0];
    return if ($f =~ /\/\d{4}\/index\.html$/);	# ignore .../2021/index.html etc.
    open(F,"< $f") || die("Cannot open $f");

    my $link = $f;
    $link =~ s/index\.html$//;
    print "\t<item>\n"
    . "\t\t<link>https://eklausmeier.goip.de/$link</link>\n"
    . "\t\t<guid>https://eklausmeier.goip.de/$link</guid>\n";

    my ($dt,$year,$month,$day,$hour,$minute,$sec);
    my ($title,$linecnt) = (0,0);
    while (<F>) {
        chomp;
        if (/^<h1.*?>(.+?)<\/h1>/) {
            printf("\t\t<title>%s</title>\n",$1);
            $title = 1;
        } elsif (/^\s*<p class=.+?>(\d+)..\s+(\w+)\s+(\d\d\d\d)<\/p>/) {
            ($year,$month,$day,$hour,$minute,$sec) = ($3,$monthNr{$2},$1,12,0,0);
            # RFC-822 format: Wed, 02 Oct 2002 13:00:00 GMT
            $dt = strftime("%a, %d %b %Y %H:%M:%S GMT",$sec,$minute,$hour,$day,$month,$year-1900);
            printf("\t\t<pubDate>%s</pubDate>\n",$dt);
        } elsif ($title) {
            if ($linecnt++ == 0) {
                print "\t\t<description><![CDATA[\n";
            }
            last if (/^\t<footer>/);
            s/<a href="\.\.\/\.\.\/\.\.\//<a href="https:\/\/eklausmeier\.goip\.de\//g;
            s/<a href="\.\.\/\.\.\/2/<a href="https:\/\/eklausmeier\.goip\.de\/blog\/2/g;
            s/<img src="\.\.\/\.\.\/\.\.\/img\//<img src="https:\/\/eklausmeier\.goip\.de\/img\//g;
            print $_ . "\n";
        }
    }
    print << "EOT";
        ]]></description>
    </item>
EOT

    close(F) || die("Cannot close $f");
}


while (<@ARGV>) {
    item($_);
}


print "</channel>\n</rss>\n";

Source code for bloghtmlrss is in GitHub.

I checked the validity of the generated RSS with W3C Feed Validation Service.