, 8 min read

Example Theme for Simplified Saaze: Lemire

Another theme for Simplified Saaze called "Lemire". You can inspect it here. This theme is modeled after the blog from Daniel Lemire. That blog is powered by WordPress and hosted on SiteGround and performance enhanded by Cloudflare since 2019. Prof. Lemire started blogging in 2004. The number of posts per year are given in below table. Year 2023 is not complete.

Year 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#posts 118 267 224 217 196 104 67 63 53 64 55 59 81 132 123 112 85 66 58 80
#comments 223 458 215 361 647 836 892 743 888 903 744 656 1340 1165 1005 1269 832 560 501 671

These numbers are given by:

for i in `seq 2004 2023`; do grep 'h2 class="entry-title"' b*.html | grep -c me/blog/$i/; done

In total there are 2,224 blog posts over 20 years of permanent blogging. It can clearly be seen that the blog is updated on a regular basis, and many readers interact with the content.

Prof. Lemire values to have control over his blog, therefore doesn't use Medium or similar offers. Some key functionalities:

  1. Allows WordPress comments
  2. Informs e-mail subscribers about new posts, he has over 12,500 mail subscribers
  3. Provides search-functionaly on his blog
  4. Doesn't show any advertisements
  5. Provides an Atom RSS feed
  6. Blog posts are all in English
  7. Doesn't use categories or tags
  8. Doesn't use the <!--more--> tag
  9. WordPress theme is based on "Twenty-Fifteen"
  10. There is no regular sitemap.xml for the blog posts

1. Converting WordPress blog. Download all blog posts via Perl script bloglemirecurl. This script downloads the so called "pages", which in turn contains 20 blog posts. This HTML file, which contains 20 blog posts, is then converted to Markdown via Perl script bloglemiremd.

bloglemiremd b*.html

The Markdown files are placed in /tmp/lemire. As usual you might need a few rounds to eliminate obvious conversion errors. Finally you copy the Markdown files from /tmp/lemire to your final destination.

There are 14 blog posts, which reside at the top of the directory, which are not part of the timeline. These posts are accessed via the left navigation bar (in blue). To convert these posts use

bloglemiremd -t *-*.html pred*.html

Again, the converted HTML files are stored under /tmp/lemire for inspection. Once you are fine with them, copy them to the final destination.

Go to .../content/blog and run below loop using blogdate to create an index.md for each year:

for i in `seq 2004 2023`; do blogdate -p/lemire/blog/ -y$i $i/*.md > $i/index.md; done

Embedding icon in head-template file:

  1. Download icon: curl https://lemire.me/blog/wp-content/uploads/2015/10/profile2011_152-150x150.jpg -o pr.jpg
  2. Converting to 32x32 size: convert -resize 32x32 pr.jpg pr32x32.jpg
  3. Base64-encoding file: base64 -w0 pr32x32.jpg

Size comparison for this icon: original JPG is 6,699 bytes, converted image is 934 bytes, base64-encoded is finally 1,248 bytes.

2. Installation. The entire theme including content and Simplified Saaze is installed via composer.

$ time composer create-project eklausme/saaze-lemire
Creating a "eklausme/saaze-lemire" project at "./saaze-lemire"
Installing eklausme/saaze-lemire (v1.0)
  - Downloading eklausme/saaze-lemire (v1.0)
  - Installing eklausme/saaze-lemire (v1.0): Extracting archive
Created project in /tmp/saaze-lemire
Loading composer repositories with package information
Updating dependencies
Lock file operations: 1 install, 0 updates, 0 removals
  - Locking eklausme/saaze (v1.34)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 1 install, 0 updates, 0 removals
  - Downloading eklausme/saaze (v1.34)
  - Installing eklausme/saaze (v1.34): Extracting archive
Generating optimized autoload files
No security vulnerability advisories found.
        real 1.85s
        user 0.27s
        sys 0
        swapped 0
        total space 0

You need to compile a single C file once:

cd vendor/eklausme/saaze
cc -fPIC -Wall -O2 -shared php_md4c_toHtml.c -o php_md4c_toHtml.so -lmd4c-html

Now you can run php saaze.

As mentioned Simplified Saaze is already installed via above composer command. In case you want to take a separate view at the Simplified Saaze source code see saaze.

3. Building static site. Running Simplified Saaze on all 2,224 blog posts:

saaze-lemire: time php saaze -rb /tmp/build
Building static site in /tmp/build...
        execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nentries=2224, totalPages=112, entries_per_page=20
Finished creating 1 collections, 1 with index, and 2259 entries (0.39 secs / 22.55MB)
#collections=1, YamlParser=0.0314/2260-1, md2html=0.0362, MathParser=0.0167/2259, renderEntry=2259, content=2259/0, excerpt=0/0
        real 0.41s
        user 0.26s
        sys 0
        swapped 0
        total space 0

In less than half a second the generation of all static files is completed. Machine in question: CPU is Ryzen 7 5700G, max clock 4.6 GHz, running on Arch Linux with kernel 6.6.8.

A screenshot of the theme is here:

Photo

The screenshot shows the results of a search, here for "WordPress".

The theme also features Pagefind. I have written on Pagefind: Searching in Static Sites. Creating the Pagefind index goes like this:

/tmp/build: time pagefind -s . --exclude-selectors aside --exclude-selectors footer

Running Pagefind v1.0.4
Running from: "/tmp/build"
Source:       ""
Output:       "pagefind"

[Walking source directory]
Found 2372 files matching **/*.{html}

[Parsing files]
Did not find a data-pagefind-body element on the site.
↳ Indexing all <body> elements on the site.

[Reading languages]
Discovered 1 language: en

[Building search indexes]
Total:
  Indexed 1 language
  Indexed 2372 pages
  Indexed 29164 words
  Indexed 0 filters
  Indexed 0 sorts

Finished in 5.325 seconds
        real 5.43s
        user 4.50s
        sys 0
        swapped 0
        total space 0

The index creation is way slower than creating all static pages.

4. Webserver rewrite rules. The conversion from WordPress to Markdown placed all blog posts from one year into a single directory at the same level. For example, the posts

https://lemire.me/blog/2006/01/03/are-debuggers-obselete/

is in directory .../content/blog/2006 and in file

01-03-are-debuggers-obselete.md

On my webserver the URL can be both, watch out for dash vs. slash:

  1. https://eklausmeier.goip.de/lemire/blog/2006/01-03-are-debuggers-obselete
  2. https://eklausmeier.goip.de/lemire/blog/2006/01/03/are-debuggers-obselete

Watch out for the / slashes. This is accomplished by below rewriting rule in the NGINX configuration file:

rewrite "^/lemire/blog/(\d\d\d\d)/(\d\d)/(\d\d)/(.*)"  "/lemire/blog/$1/$2-$3-$4";

Instead of above rewriting rule once could place above Markdown file in the following directory

.../content/blog/2006/01/03

But this would create a lot of directories, which essentially all contain only a single file.

5. Fetching comments from WordPress. Perl script bloglemirecurlcomment scans through above "pages", i.e., collection of 20 blog posts. These pages contain 20 URLs. These URLs are fetched via curl. Essentially, this duplicates the blog posts, but at least we now have the comments for each post as well.

for i in `seq 1 112`; do bloglemirecurlcomment ../b$i.html; done

These HTML files are then processed by bloglemirecomment, which scans for <h2 class="comments-title"> and writes out the comment file. Each comment file is generated from the original blog post file by adding the word -comment- to the file name after the day.

Type File name
Blog post /blog/yyyy/mm/dd/title.html
Comment file /blog/yyyy/mm-dd-comment-title.md

Each comment file has index: false, i.e., it will not show up in the index. Though, all content is fully searchable.

In addition the Perl script blogdate adds a link to each comment file. Calling is like:

for i in `seq 2004 2023`; do ( cd $i; ~/php/saaze-lemire/bin/blogdate -y$i *.md > index.md ) done

Counting the number of comments per year is like:

#!/bin/perl -W
# Count comments per year

use strict;

my ($year,%H) = (0,());

while (<>) {
    $year = $1 if (/<link rel="canonical" href="https:\/\/lemire.me\/blog\/(\d\d\d\d)\/(\d\d)\/(\d\d)\//);
    if (/(\w+) thought(|s) on &ldquo;/) {
        my $cnt = $1;
        $cnt = 1 if ($cnt eq 'One');
        $H{$year} += $cnt;
    }
}

for (sort keys %H) {
    printf("%04d\t%d\n",$_,$H{$_});
}

6. Building static site with separate comment pages. Generating all static pages for the entire blog including comments is:

saaze-lemire: time php saaze -rb /tmp/build
Building static site in /tmp/build...
        execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nentries=2224, totalPages=112, entries_per_page=20
Finished creating 1 collections, 1 with index, and 3935 entries (0.89 secs / 66.49MB)
#collections=1, YamlParser=0.0630/3936-1, md2html=0.0895, MathParser=0.0575/3935, renderEntry=3935, content=3935/0, excerpt=0/0
        real 0.91s
        user 0.56s
        sys 0
        swapped 0
        total space 0

This time can be reduced to 0.46 seconds, see Parallelizing the Output of Simplified Saaze.

Generating the pagefind index for 4048 files takes roughly 12 seconds:

/tmp/build: time pagefind -s . --exclude-selectors aside --exclude-selectors footer

Running Pagefind v1.0.4
Running from: "/tmp/build"
Source:       ""
Output:       "pagefind"

[Walking source directory]
Found 4048 files matching **/*.{html}

[Parsing files]
Did not find a data-pagefind-body element on the site.
↳ Indexing all <body> elements on the site.

[Reading languages]
Discovered 1 language: en

[Building search indexes]
Total:
  Indexed 1 language
  Indexed 4048 pages
  Indexed 60783 words
  Indexed 0 filters
  Indexed 0 sorts

Finished in 11.412 seconds
        real 11.59s
        user 10.22s
        sys 0
        swapped 0
        total space 0

Simplified Saaze allows to generate single files, i.e., only a single blog post can be processed by Simplified Saaze, see Single file generation. This can be used to significantly reduce the generation time.

7. HTML validation. The original site lemire.me contains more than 90 warnings and errors. See W3 Nu Html Checker.

The new site contains no errors or warnings.

8. Recap. Prof. Lemire is quite hesitant to move all static:

Several commenters pointed out that I could just drop WordPress and use something else. I fear that they greatly underestimate how hard this would be. Yes, I know about things like Hugo. My relatively simple home page is built using Hugo… and it took me nearly took weeks of hacking to get it to be how I want. Porting my blog to something like Hugo would be a major disruption, might imply moving to disqus (see point above) and so forth.

Porting Prof. Lemire's blog started in 12-Dec-2023 and was "finished" 14-Jan-2024 including porting all comments to HashOver. Of course, I did not work on this full-time.

There are still some open issues pending regarding conversion and functionality:

  1. Some pages have wrong formatting, e.g., there is bold printing in the converted site not present in the original.
  2. Left and right double quotes have been converted to HTML codes. Entering those is not very convenient. We clearly want SmartyPants.
  3. Five URLs were not correctly mapped as they contain special characters.
  4. E-mail subscriptions is absent. Although I doubt that there really 12,500 active subscribers. Though, there are probably a lot, which want to get noticed when something new arrives. One possible approach is to use Buttondown. For example, Buttondown can send e-mails based on RSS, see below screenshot from the "Settings" dialog in Buttondown.

Tool Purpose Technology
Simplified Saaze Static site generator PHP, C
HashOver Commenting system PHP, XML/JSON/SQLite
Pagefind Static search JavaScript, Rust, WebAssembly