, 7 min read

MD4C PHP Extension

This blog uses MD4C to convert Markdown to HTML. So far I used PHP:FFI to link PHP with the MD4C C library. PHP:FFI is "Foreign Function Interface" in PHP and allows to call C functions from PHP without writing a PHP extension. Using FFI is very easy.

Previous profiling measurements with XHProf and PHPSPY indicated that the handling of the return value from MD4C via FFI::String takes some time. So I changed FFI to a "real" PHP extension. I measured again. Result: No difference between FFI and PHP extension. So the profiling measurements were misleading.

Also the following claim in the PHP manual is downright false:

it makes no sense to use the FFI extension for speed; however, it may make sense to use it to reduce memory consumption.

Nevertheless, writing a PHP extension was a good exercise.

Literature on writing PHP extension are here:

  1. Sara Golemon: Extending and Embedding PHP, Sams Publishing, 2006, xx+410 p.
  2. PHP Internals: Zend extensions
  3. https://github.com/dstogov/php-extension

The PHP extension code is in GitHub: php-md4c.

1. Walk through the C code. For this simple extension there is no need for a separate header file. The extension starts with basic includes for PHP, for the phpinfo(), and for MD4C:

// MD4C extension for PHP: Markdown to HTML conversion

#ifdef HAVE_CONFIG_H
#include "config.h"
#endif

#include <php.h>
#include <ext/standard/info.h>
#include <md4c-html.h>

The following code is directly from the FFI part php_md4c_toHtml.c:

struct membuffer {
    char* data;
    size_t asize;	// allocated size = max usable size
    size_t size;	// current size
};

The following routines are also almost the same as in the FFI case, except that memory allocation is using safe_pemalloc() instead of native malloc(). In our case this doesn't make any difference.

static void membuf_init(struct membuffer* buf, MD_SIZE new_asize) {
    buf->size = 0;
    buf->asize = new_asize;
    if ((buf->data = safe_pemalloc(buf->asize,sizeof(char),0,1)) == NULL)
        php_error_docref(NULL, E_ERROR, "php-md4c.c: membuf_init: safe_pemalloc() failed with asize=%ld.\n",(long)buf->asize);
}

Next routine uses safe_perealloc() instead of realloc().

static void membuf_grow(struct membuffer* buf, size_t new_asize) {
    buf->data = safe_perealloc(buf->data, sizeof(char*), new_asize, 0, 1);
    if (buf->data == NULL)
        php_error_docref(NULL, E_ERROR, "php-md4c.c: membuf_grow: realloc() failed, new_asize=%ld.\n",(long)new_asize);
    buf->asize = new_asize;
}

The rest is identical to FFI.

static void membuf_append(struct membuffer* buf, const char* data, MD_SIZE size) {
    if (buf->asize < buf->size + size)
        membuf_grow(buf, buf->size + buf->size / 2 + size);
    memcpy(buf->data + buf->size, data, size);
    buf->size += size;
}

static void process_output(const MD_CHAR* text, MD_SIZE size, void* userdata) {
    membuf_append((struct membuffer*) userdata, text, size);
}

static struct membuffer mbuf = { NULL, 0, 0 };

Now we come to something PHP specific. We encapsulate the C function into PHP_FUNCTION. Furthermore the arguments of the routine are parsed with ZEND_PARSE_PARAMETERS_START(1, 2). This routine must have at least one argument. It might have an optional second argument. That is what is meant by (1,2). The return string is allocated via estrndup(). In the FFI case we just return a pointer to a string.

/* {{{ string md4c_toHtml( string $markdown, [ int $flag ] )
 */
PHP_FUNCTION(md4c_toHtml) {	// return HTML string
    char *markdown;
    size_t markdown_len;
    int ret;
    long flag = MD_DIALECT_GITHUB | MD_FLAG_NOINDENTEDCODEBLOCKS;

    ZEND_PARSE_PARAMETERS_START(1, 2)
        Z_PARAM_STRING(markdown, markdown_len)
        Z_PARAM_OPTIONAL Z_PARAM_LONG(flag)
    ZEND_PARSE_PARAMETERS_END();

    if (mbuf.asize == 0) membuf_init(&mbuf,16777216);	// =16MB

    mbuf.size = 0;	// prepare for next call
    ret = md_html(markdown, markdown_len, process_output,
        &mbuf, (MD_SIZE)flag, 0);
    membuf_append(&mbuf,"\0",1); // make it a null-terminated C string, so PHP can deduce length
    if (ret < 0) {
        RETVAL_STRINGL("<br>- - - Error in Markdown - - -<br>\n",sizeof("<br>- - - Error in Markdown - - -<br>\n"));
    } else {
        RETVAL_STRING(estrndup(mbuf.data,mbuf.size));
    }
}
/* }}}*/

The following two PHP extension specific functions are just for initialization and shutdown. The following diagram from PHP internals shows the sequence of initialization and shutdown.

Init: Do nothing.

/* {{{ PHP_MINIT_FUNCTION
 */
PHP_MINIT_FUNCTION(md4c) {	// module initialization
    //REGISTER_INI_ENTRIES();
    //php_printf("In PHP_MINIT_FUNCTION(md4c): module initialization\n");

    return SUCCESS;
}
/* }}} */

Shutdown: Do nothing.

/* {{{ PHP_MSHUTDOWN_FUNCTION
 */
PHP_MSHUTDOWN_FUNCTION(md4c) {	// module shutdown
    if (mbuf.data) pefree(mbuf.data,1);
    return SUCCESS;
}
/* }}} */

The following function prints out information when called via phpinfo().

/* {{{ PHP_MINFO_FUNCTION
 */
PHP_MINFO_FUNCTION(md4c) {
    php_info_print_table_start();
    php_info_print_table_row(2, "MD4C", "enabled");
    php_info_print_table_row(2, "PHP-MD4C version", "1.0");
    php_info_print_table_row(2, "MD4C version", "0.5.2");
    php_info_print_table_end();
}
/* }}} */

The output looks like this:

Below describes the argument list.

/* {{{ arginfo
 */
ZEND_BEGIN_ARG_INFO(arginfo_md4c_test, 0)
ZEND_END_ARG_INFO()

ZEND_BEGIN_ARG_INFO(arginfo_md4c_toHtml, 1)
    ZEND_ARG_INFO(0, str)
    ZEND_ARG_INFO_WITH_DEFAULT_VALUE(0, flag, "MD_DIALECT_GITHUB | MD_FLAG_NOINDENTEDCODEBLOCKS")
ZEND_END_ARG_INFO()
/* }}} */

/* {{{ php_md4c_functions[]
 */
static const zend_function_entry php_md4c_functions[] = {
    PHP_FE(md4c_toHtml,	arginfo_md4c_toHtml)
    PHP_FE_END
};
/* }}} */

The zend_module_entry is somewhat classical. All the above is configured here.

/* {{{ md4c_module_entry
 */
zend_module_entry md4c_module_entry = {
    STANDARD_MODULE_HEADER,
    "md4c",						// Extension name
    php_md4c_functions,			// zend_function_entry
    NULL,	//PHP_MINIT(md4c),	// PHP_MINIT - Module initialization
    PHP_MSHUTDOWN(md4c),		// PHP_MSHUTDOWN - Module shutdown
    NULL,						// PHP_RINIT - Request initialization
    NULL,						// PHP_RSHUTDOWN - Request shutdown
    PHP_MINFO(md4c),			// PHP_MINFO - Module info
    "1.0",						// Version
    STANDARD_MODULE_PROPERTIES
};
/* }}} */

This seemingly innocent looking statement is important: Without it you will get PHP Startup: Unable to load dynamic library.

#ifdef COMPILE_DL_TEST
# ifdef ZTS
ZEND_TSRMLS_CACHE_DEFINE()
# endif
#endif
ZEND_GET_MODULE(md4c)

2. M4 config file. PHP extension require a config.m4 file.

dnl config.m4 for php-md4c extension

PHP_ARG_WITH(md4c, [whether to enable MD4C support],
[  --with-md4c[[=DIR]]       Enable MD4C support.
                          DIR is the path to MD4C install prefix])

if test "$PHP_YAML" != "no"; then

    AC_MSG_CHECKING([for md4c headers])
    for i in "$PHP_MD4C" "$prefix" /usr /usr/local; do
        if test -r "$i/include/md4c-html.h"; then
            PHP_MD4C_DIR=$i
            AC_MSG_RESULT([found in $i])
            break
        fi
    done
    if test -z "$PHP_MD4C_DIR"; then
        AC_MSG_RESULT([not found])
        AC_MSG_ERROR([Please install md4c])
    fi

    PHP_ADD_INCLUDE($PHP_MD4C_DIR/include)
    dnl recommended flags for compilation with gcc
    dnl CFLAGS="$CFLAGS -Wall -fno-strict-aliasing"

    export OLD_CPPFLAGS="$CPPFLAGS"
    export CPPFLAGS="$CPPFLAGS $INCLUDES -DHAVE_MD4C"
    AC_CHECK_HEADERS([md4c.h md4c-html.h], [], AC_MSG_ERROR(['md4c.h' header not found]))
    #AC_CHECK_HEADER([md4c-html.h], [], AC_MSG_ERROR(['md4c-html.h' header not found]))
    PHP_SUBST(MD4C_SHARED_LIBADD)

    PHP_ADD_LIBRARY_WITH_PATH(md4c, $PHP_MD4C_DIR/$PHP_LIBDIR, MD4C_SHARED_LIBADD)
    PHP_ADD_LIBRARY_WITH_PATH(md4c-html, $PHP_MD4C_DIR/$PHP_LIBDIR, MD4C_SHARED_LIBADD)
    export CPPFLAGS="$OLD_CPPFLAGS"

    PHP_SUBST(MD4C_SHARED_LIBADD)
    AC_DEFINE(HAVE_MD4C, 1, [ ])
    PHP_NEW_EXTENSION(md4c, md4c.c, $ext_shared)
fi

3. Compiling. Run

phpize
./configure
make

Symbols are as follows:

$ nm md4c.so
0000000000002160 r arginfo_md4c_test
0000000000003d00 d arginfo_md4c_toHtml
                 w __cxa_finalize@GLIBC_2.2.5
00000000000040a0 d __dso_handle
0000000000003dc0 d _DYNAMIC
                 U _emalloc
                 U _emalloc_64
                 U _estrndup
00000000000016c8 t _fini
                 U free@GLIBC_2.2.5
00000000000016c0 T get_module
0000000000003fe8 d _GLOBAL_OFFSET_TABLE_
                 w __gmon_start__
00000000000021c8 r __GNU_EH_FRAME_HDR
0000000000001000 t _init
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
0000000000004180 b mbuf
00000000000040c0 D md4c_module_entry
                 U md_html
                 U memcpy@GLIBC_2.14
                 U php_error_docref
                 U php_info_print_table_end
                 U php_info_print_table_row
                 U php_info_print_table_start
0000000000003d60 d php_md4c_functions
                 U php_printf
0000000000001640 t process_output
0000000000001234 t process_output.cold
                 U _safe_malloc
                 U _safe_realloc
                 U __stack_chk_fail@GLIBC_2.4
                 U strlen@GLIBC_2.2.5
0000000000004168 d __TMC_END__
                 U zend_parse_arg_long_slow
                 U zend_parse_arg_str_slow
                 U zend_wrong_parameter_error
                 U zend_wrong_parameters_count_error
                 U zend_wrong_parameters_none_error
. . .
0000000000001380 T zif_md4c_toHtml
00000000000011cf t zif_md4c_toHtml.cold
0000000000001175 T zm_info_md4c
0000000000001350 T zm_shutdown_md4c
00000000000016b0 T zm_startup_md4c

4. Installing on Arch Linux. Copy the md4c.so library to /usr/lib/php/modules as root:

cp modules/md4c.so /usr/lib/php/modules

Finally activate the extension in php.ini:

extension=md4c

5. Notes on Windows. On Linux we use the installed MD4C library. As noted in Installing Simplified Saaze on Windows 10 #2 it is advisable to amalgamate all MD4C source files into a single file for easier compilation.