Friday, July 4, 2008

Publishing Latex to the Web

It's gross.

With that in mind, I'll say a few words about how I got somewhat nice-looking mathematical notation into blogspot a few days ago. I originally wrote that entry in TexShop and put in on my website as a PDF file. While this shares it with my fans, there are some disadvantages to PDfs as opposed to HTML in a blog entry:

* As a blog entry, it is intregrated into my blog and it benefits from the search engine crawling, incoming links, tags and appearing with the rest of my recent writings.

* PDFs are less accessible than HTML. It's often more work for the reader to download it if your browser doesn't display PDFs inline and it can be harder to search or cut and paste.

Let me mention two of my workflow principles:

* There must be one master version of a document. All edits should be made to this copy, and changes are propagated to all other formats and versions as automatically as possible.

* The master version should be the format where is the easiest and most natural to express oneself. For example, LaTeX makes it many times easier to create equations such as those appearing in my most recent posting mentioned above.

It was clear I had to do the editing in LaTeX, and develop a way to publish to Blogspot. Below is the Bash script I came up with. It has very rough edges. I believe in programming with a focus on a particular problem. As I use the programs in this entry in the future, they will doubtlessly improve greatly. In the meantime, any comments are welcome. Be warned that Blogspot's narrow text column may be wrapping some of the lines.

#!/bin/bash

tex_file=$1
name=`cut -f1 -d'.' <<< $tex_file`
echo Operating on $name...
rm -f -- $name-trimmed.html.txt temp.html links.sh
latex2html -no_navigation -auto_prefix -split 0 -info "" -noaddress $tex_file
xmllint --format --html $name/index.html > index.html.new
mv -f index.html.new $name/index.html
arg="'s/src="'"'$name'-img/src="http:\/\/jonathanwellons.com\/'$name'\/'$name'-img/g'"'"
echo $arg
echo perl -pi -e $arg $name/index.html >> links.sh
chmod 755 links.sh
./links.sh
purge-html.pl $name/index.html >> $name-trimmed.html.txt
rsync -vaz $name/ jonathan@my_servers_ip:/var/www/jonathan/$name


Let's step through the lines one by one.


#!/bin/bash

This tells the shell to use bash to run the script.

tex_file=$1
name=`cut -f1 -d'.' <<< $tex_file`
echo Operating on $name...

This segment takes in a variable tex_file that holds the name of the Latex file to convert. It then trims off the extension and saves it as the name of the project.

rm -f -- $name-trimmed.html.txt temp.html links.sh
This clears out the temporary files we'll use later.

latex2html -no_navigation -auto_prefix -split 0 -info "" -noaddress $tex_file
This is the most important part. There is actually a pre-existing program called latex2html that does just what it suggests. It doesn't do a perfect job, and blogspot has a number of idiosyncrasies, most of the rest of the script is accommodating the personality of this wonderful (seriously) program.

xmllint --format --html $name/index.html > index.html.new
mv -f index.html.new $name/index.html

xmllint is an excellent program that cleans up XML and HTML. I use it here to make the HTML much nicer to look at and process later.

arg="'s/src="'"'$name'-img/src="http:\/\/jonathanwellons.com\/'$name'\/'$name'-img/g'"'"
echo $arg
echo perl -pi -e $arg $name/index.html >> links.sh
chmod 755 links.sh
./links.sh

One of Blogspot's weaknesses (to me) is image uploading. It seems that you can only upload five at a time, and even without that restriction, I wouldn't put up with the hassle of uplaoding the dozens of images needed for a blog entry generated from Latex. This is especially true because many of them could change as the source document is corrected or refined. Instead, I host the images on my personal site and change the links produced by latex2html to use these new URLs.

purge-html.pl $name/index.html >> $name-trimmed.html.txt
Much of the HTML tags produced by Latex2html are redundant or interact poorly with Blogspot. This is a program I wrote to strip most HTML tags, which can be found at the end of the file. I originally set out to use Perl's HTML::Parser module, but this turned out to be easier to do myself. This leaves a file that can be opened in any text editor and copied into the Blogspot editing interface. For my last entry, it was barrels-of-gold-trimmed.html.txt.

rsync -vaz $name/ jonathan@my_servers_ip:/var/www/jonathan/$name
Finally, push all the images produced by latex2html up to my server.

Any comments or ways to improve are welcome (I am sure there are many). In particular, if someone knows a way to better clean up the spacing and get rid of the borders on the images that Blogspot's CSS adds, that would be helpful.

Appendix: purge-html.pl



#!/usr/bin/perl -w

use strict;
use HTML::Parser;
use Data::Dumper;

my $file = shift;
my $okay_tags = {
    'img' => 1, 'table' => 1,
    'td' => 1, 'i' => 1,
    'tr' => 1, 'div' => 1,
    'a' => 1, 'h2' => 1,
};
my $no_content_tags = {
    'title' => 1, 'h1' => 1,
};

open(IN, "<$file") or die "Couldn't open $file: $!";
undef $/;
my $data = <IN>;
$data =~ s/<!--[^>]*>//gs;
my $lines = [ split('<', $data) ];
my $result = '';
foreach my $line (@$lines) {
    $line = '<' . $line;
    if ($line =~ /^<$/) {next;}
    if ($line =~ /^<!/) {next;}

    $line =~ /<(\/?)([^ >]*)([^>]*)>(.*)/s;
    my $end = $1;
    my $tag = $2;
    my $rest = $3;
    my $content = $4;

    my $temp_result = '';

    if ($tag =~ /^a$/) {
        if ($rest =~ /(.*")(.*)(#.*)/) {
            $rest = "$1$3";
        }
        $result =~ s/\n$/ /;
    }

    if ($okay_tags->{$tag}) {
        $temp_result = '<' . $end . $tag . $rest .'>';
    }
    unless($no_content_tags->{$tag}) {
        $temp_result .= $content;
    }
    if ($tag =~ /^a$/ || $content =~ /^$/) {
        $temp_result =~ s/\n//g;
    }
    $result .= $temp_result;
}

$result =~ s/\n\n*/\n/sg;
$result =~ s/^\n//g;
print $result;

2 comments:

Anonymous said...

Dude, one word: MathML.

Do you read Jacques Distler at UT? He's involved a lot in the XHTML/HTML5/Atom scene, and as you can see on his blog, gets away with some pretty sick formulas.

Anonymous said...

Thank you for such a interesting post!