Tuesday, June 28, 2005

WEB: Tangle, Weave

I'm old enough to have seen a few things go by in the technology world, some good, some bad. What fascinates me is when younger people accidentally re-invent something that has been done before. This happens a lot more than you might think.

A historical case in point....

A couple of decades ago or so, the grandfather of Computer Science, Donald Knuth, chairman of the CS department at Stanford for many, many years, invented a typesetting language called TeX. It was one of the first markup languages, and was created mostly to typeset mathematics, which is very hard to do. Knuth was one of the first people to be interested in digital typography and typesetting and in fact created a Digital Typography degree program at Stanford, which graduated I think only about a half dozen people, most of whom I know personally (Dan Mills, Carol Twombly, David Siegel, Cleo Huggins). But I digress...

Knuth created a system he called WEB in about 1982 (10 years before the "world wide web") which was a programming language (PASCAL, more or less) but the comments in the programming language were in fact instructions to a typesetting program that would typeset the text of the program.

The key idea here is that a single text file (a .web file) could be read by two different programs, and each would see something different:

  • the PASCAL compiler would see source code

  • the WEB compiler would see typesetting instructions

If I recall correctly, the interpreter that read the "program" code was called Tangle, and the interpreter that read the "typesetting" code was called Weave.

The notion of "structured comments" I think may have started here. By structured comments I mean information that technically is "comments" (not interpreted as part of the source code) but they are structured in such a way that other programs can make sense of them.

I borrowed this idea in 1985 or so to enhance and expand PostScript comments to contain information to be read by document processing systems. Inserting %%Page: 3 into a PostScript program allowed other programs to find the page boundaries.

This whole idea is being reinvented to some degree on the WWW now. What began as HTML tags have grown additional appendages or "tags" or "fields" which are used to communicate with other programs which read the HTML pages (search engine crawlers, or XML parsers, or whatever).

There are good and bad things about this. The good ideas are obvious: an HTML page can contain hidden, embedded instructions for other programs to read. The bad ideas are harder to spot, but are real, and insidious. One is this: validating these files is almost impossible. Information in a "comment" cannot be verified by the parser to which it is a comment, by definition. Even JavaScript programs look like "comments" to most HTML parsers, including search engines. You can't spot a JavaScript error except by running it through a JavaScript interpreter, which in turn sees HTML code as "comments".

I chuckle to myself when I hear terms like "micro-formats" and I see entire companies like Technorati being built around the idea of hiding metadata inside HTML comments. It will certainly work for a while, but it's extremely hard to standardize and police, it's trivial for competing standards to pop up, and eventually the whole thing collapses under the weight of micro-variations, competing non-standard implementations, and sometimes, just plain old questionable motives.

No comments: