Sunday, July 6, 2008

Wordpress Blog

I'm experimenting with a Wordpress blog. So far the key advantage is that it does not impose the black boxes on my images.

Friday, July 4, 2008

Publishing Latex to the Web

It's gross.

With that in mind, I'll say a few words about how I got somewhat nice-looking mathematical notation into blogspot a few days ago. I originally wrote that entry in TexShop and put in on my website as a PDF file. While this shares it with my fans, there are some disadvantages to PDfs as opposed to HTML in a blog entry:

* As a blog entry, it is intregrated into my blog and it benefits from the search engine crawling, incoming links, tags and appearing with the rest of my recent writings.

* PDFs are less accessible than HTML. It's often more work for the reader to download it if your browser doesn't display PDFs inline and it can be harder to search or cut and paste.

Let me mention two of my workflow principles:

* There must be one master version of a document. All edits should be made to this copy, and changes are propagated to all other formats and versions as automatically as possible.

* The master version should be the format where is the easiest and most natural to express oneself. For example, LaTeX makes it many times easier to create equations such as those appearing in my most recent posting mentioned above.

It was clear I had to do the editing in LaTeX, and develop a way to publish to Blogspot. Below is the Bash script I came up with. It has very rough edges. I believe in programming with a focus on a particular problem. As I use the programs in this entry in the future, they will doubtlessly improve greatly. In the meantime, any comments are welcome. Be warned that Blogspot's narrow text column may be wrapping some of the lines.

#!/bin/bash

tex_file=$1
name=`cut -f1 -d'.' <<< $tex_file`
echo Operating on $name...
rm -f -- $name-trimmed.html.txt temp.html links.sh
latex2html -no_navigation -auto_prefix -split 0 -info "" -noaddress $tex_file
xmllint --format --html $name/index.html > index.html.new
mv -f index.html.new $name/index.html
arg="'s/src="'"'$name'-img/src="http:\/\/jonathanwellons.com\/'$name'\/'$name'-img/g'"'"
echo $arg
echo perl -pi -e $arg $name/index.html >> links.sh
chmod 755 links.sh
./links.sh
purge-html.pl $name/index.html >> $name-trimmed.html.txt
rsync -vaz $name/ jonathan@my_servers_ip:/var/www/jonathan/$name


Let's step through the lines one by one.


#!/bin/bash

This tells the shell to use bash to run the script.

tex_file=$1
name=`cut -f1 -d'.' <<< $tex_file`
echo Operating on $name...

This segment takes in a variable tex_file that holds the name of the Latex file to convert. It then trims off the extension and saves it as the name of the project.

rm -f -- $name-trimmed.html.txt temp.html links.sh
This clears out the temporary files we'll use later.

latex2html -no_navigation -auto_prefix -split 0 -info "" -noaddress $tex_file
This is the most important part. There is actually a pre-existing program called latex2html that does just what it suggests. It doesn't do a perfect job, and blogspot has a number of idiosyncrasies, most of the rest of the script is accommodating the personality of this wonderful (seriously) program.

xmllint --format --html $name/index.html > index.html.new
mv -f index.html.new $name/index.html

xmllint is an excellent program that cleans up XML and HTML. I use it here to make the HTML much nicer to look at and process later.

arg="'s/src="'"'$name'-img/src="http:\/\/jonathanwellons.com\/'$name'\/'$name'-img/g'"'"
echo $arg
echo perl -pi -e $arg $name/index.html >> links.sh
chmod 755 links.sh
./links.sh

One of Blogspot's weaknesses (to me) is image uploading. It seems that you can only upload five at a time, and even without that restriction, I wouldn't put up with the hassle of uplaoding the dozens of images needed for a blog entry generated from Latex. This is especially true because many of them could change as the source document is corrected or refined. Instead, I host the images on my personal site and change the links produced by latex2html to use these new URLs.

purge-html.pl $name/index.html >> $name-trimmed.html.txt
Much of the HTML tags produced by Latex2html are redundant or interact poorly with Blogspot. This is a program I wrote to strip most HTML tags, which can be found at the end of the file. I originally set out to use Perl's HTML::Parser module, but this turned out to be easier to do myself. This leaves a file that can be opened in any text editor and copied into the Blogspot editing interface. For my last entry, it was barrels-of-gold-trimmed.html.txt.

rsync -vaz $name/ jonathan@my_servers_ip:/var/www/jonathan/$name
Finally, push all the images produced by latex2html up to my server.

Any comments or ways to improve are welcome (I am sure there are many). In particular, if someone knows a way to better clean up the spacing and get rid of the borders on the images that Blogspot's CSS adds, that would be helpful.

Appendix: purge-html.pl



#!/usr/bin/perl -w

use strict;
use HTML::Parser;
use Data::Dumper;

my $file = shift;
my $okay_tags = {
    'img' => 1, 'table' => 1,
    'td' => 1, 'i' => 1,
    'tr' => 1, 'div' => 1,
    'a' => 1, 'h2' => 1,
};
my $no_content_tags = {
    'title' => 1, 'h1' => 1,
};

open(IN, "<$file") or die "Couldn't open $file: $!";
undef $/;
my $data = <IN>;
$data =~ s/<!--[^>]*>//gs;
my $lines = [ split('<', $data) ];
my $result = '';
foreach my $line (@$lines) {
    $line = '<' . $line;
    if ($line =~ /^<$/) {next;}
    if ($line =~ /^<!/) {next;}

    $line =~ /<(\/?)([^ >]*)([^>]*)>(.*)/s;
    my $end = $1;
    my $tag = $2;
    my $rest = $3;
    my $content = $4;

    my $temp_result = '';

    if ($tag =~ /^a$/) {
        if ($rest =~ /(.*")(.*)(#.*)/) {
            $rest = "$1$3";
        }
        $result =~ s/\n$/ /;
    }

    if ($okay_tags->{$tag}) {
        $temp_result = '<' . $end . $tag . $rest .'>';
    }
    unless($no_content_tags->{$tag}) {
        $temp_result .= $content;
    }
    if ($tag =~ /^a$/ || $content =~ /^$/) {
        $temp_result =~ s/\n//g;
    }
    $result .= $temp_result;
}

$result =~ s/\n\n*/\n/sg;
$result =~ s/^\n//g;
print $result;

Tuesday, July 1, 2008

Barrels of Gold

Problem


My friend Brian Marion posed a variant of the following problem:
Suppose I just robbed Fort Knox. I was in a rush and I put the gold in what I had available: 3 large barrels. I didn't keep track of how much I put in each one, I just threw bars into whichever was closest until the klaxons drove my nerves over the edge. Then I hammered down their lids and bailed - I can't even be sure that a particular barrel has anything in it.
Since you helped plan the heist, I'll let you have any one of these three sealed barrels that you want. Of course, even an empty barrel is too heavy to lift by hand, so it's hard to tell which has the most gold. The only way you can measure their contents is to see if the forklift can lift them. As a safety feature, our forklift can be calibrated to a certain weight, which it will lift no more than. However we can only set this once because changing it a second time requires retooling and a visit from the manufacturer, which would blow our cover.
We know this: these barrels are holding between 0 and 1000 pounds of gold, with a uniform, random distribution. You get to pick a weight, any weight you'd like and test which barrels are heavier than that, and which are lighter. If just one is heavier and two are lighter - great! You found the one with the most gold. If there's a tie for heaviest, you'll just have to pick one. But hey, you might eliminate the lightest one and that's better than no information, right?
Suppose the forklift's zero calibration is the weight of an empty barrel, so we can ignore that number. Therefore, if you set it to 500 pounds, you have exactly a 1/2 chance of getting a particular barrel too heavy to lift, but a 3/8 chance of getting one too heavy and two too light. On the other hand, you might get two too heavy, and you'll wish you had set it higher. Obviously, there's a tradeoff: higher weights are less likely to have a tie among several barrels being the heaviest, but on the other hand they risk overshooting all the barrels and you'll be stuck just picking one of the three.
What is the weight capacity you should set on the forklift to maximize your average expected gain?

Solution


First we'll solve a more general problem, then plug in the numbers to solve this one. Without loss of generality, say the amount of gold in a barrel ranges from 0 to 1. Suppose there are $b$ barrels and $g$ (for guesses) calibrations available to us. Suppose the set $C$ is the set of calibrations, which we'll write as
$C = \{c_{i} \hspace{.3em} \vert \hspace{.3em} i \in [1,g]\}$. $c_{g}$ will be the biggest one and they descend from there, so this equation is true:
$\displaystyle 0 < c_{1} < ... < c_{g-1} < c_{g} < 1$  
(1)

Finally, add the two constants $c_{g+1} = 1$ and $c_{0} = 0$ to make our formulas cleaner. Each barrel can fall into any of the $g + 1$ ranges: less than the smallest calibration, bigger than the largest, or between any pair of consecutive calibrations.
To find the expected value of a set of calibrations, we can find the expected values of each of these gaps and multiply those values by their probabilities, then sum up the products. The average value of a barrel between $c_{i+1}$ and $c_{i}$ is
$\frac{c_{j+1}+c_{j}}{2}$ and the only way you'll end up with that as the amount of gold you get is if the largest barrel is in that range. Fortunately, that probability is easy to calculate: the chance of all the barrels being less than $c_{i+1}$ is $c_{i+1}^{b}$ and the chance of them all being smaller than $c_{i}$ is $c_{i}^{b}$, so we can simply subtract these two powers to get the probability that all the barrels are less than $c_{i+1}$, but not all are less than $c_{i}$.
Then the expected value is
$\displaystyle E(C)$$\textstyle =$$\displaystyle \sum_{j = 0}^{g}\frac{c_{j+1}+c_{j}}{2}(c_{j+1}^{b} - c_{j}^{b})$
(2)
 $\textstyle =$$\displaystyle \frac{1}{2} \left[\sum_{j = 0}^{g}(c_{j+1}+c_{j})(c_{j+1}^{b} - c_{j}^{b}) \right]$
 
 $\textstyle =$$\displaystyle \frac{1}{2} \left[\sum_{j = 0}^{g}(c_{j+1}^{b+1} - c_{j+1}c_{j}^{b} +c_{j}c_{j+1}^{b} - c_{j}^{b+1}) \right]$
 
 $\textstyle =$$\displaystyle \frac{1}{2} \left[\sum_{j = 0}^{g}(c_{j}c_{j+1}^{b} - c_{j+1}c_{j}^{b}) +\sum_{j = 0}^{g}(c_{j+1}^{b+1}- c_{j}^{b+1}) \right]$
 
 $\textstyle =$$\displaystyle \frac{1}{2} \left[\sum_{j = 0}^{g}(c_{j}c_{j+1}^{b} - c_{j+1}c_{j}^{b}) + 1 \right]$
 

The maximum value of $E(C)$ can be found by taking partial derivatives with respect to each $c_{i} \in C$, setting these expressions to zero, and solving them simultaneously. The constant 1 will not survive differentiation and the $\frac{1}{2}$ is superfluous. A variable $c_{i}$ appears in 4 terms, namely

\begin{eqnarray*}c_{i}c_{i+1}^{b}-c_{i}^{b}c_{i+1} + c_{i-1}c_{i}^{b}-c_{i-1}^{b}c_{i}\end{eqnarray*}

with derivative,

\begin{eqnarray*}c_{i+1}^{b}-bc_{i}^{b-1}c_{i+1} + bc_{i-1}c_{i}^{b-1}-c_{i-1}^{b}\end{eqnarray*}

Thus the solution can be found by solving the equations:
$\displaystyle c_{i+1}^{b}-bc_{i}^{b-1}c_{i+1} + bc_{i-1}c_{i}^{b-1}-c_{i-1}^{b} = 0, \forall i \in [1, g]$  
(3)

Finally, If g = 1 and b = 3


Returning to our original problem, if there are 3 barrels and 1 calibration $C = \{c_{1}\}$, the Equations (3) require that we solve

\begin{eqnarray*}1 - 3c_{1}^{2} & = & 0\end{eqnarray*}

Therefore, the solution is
$c_{1} = \frac{1}{\sqrt{3}} = 0.57735... $, meaning the best calibration is 577 lbs. This has a nice 42% chance of isolating the heaviest barrel and a mere 27% chance of not separating the barrels at all. You must be wondering what your payout is. By Equation 2 it is 748 pounds of Gold.

Another Example, g = 3 and b = 4


As another example, for 3 calibrations
$C = \{c_{1}, c_{2}, c_{3}\}$ and 4 barrels, Equations (3) are

\begin{eqnarray*}c_{2}^{4}-4c_{1}^{3}c_{2} & = & 0 \\c_{3}^{4}-4c_{2}^{3}c_{3......{4} & = & 0 \\1-4c_{3}^{3} + 4c_{2}c_{3}^{3}-c_{2}^{4} & = & 0\end{eqnarray*}

This might be pretty tricky to solve by hand, but the mathematical software Sage gives the numerical solution:

\begin{eqnarray*}c_{3} & = & 0.83463... \\c_{2} & = & 0.64395... \\c_{1} & = & 0.40566...\end{eqnarray*}

In fact, Sage finds 57 solutions but only 5 are real. Of these 5, only 1 satisfies the ordering in Inequalities (1).