Digital Archeology and Markup Options

I’m currently developing a tool for archivists which will allow them — or indeed anyone — to “publish” a repository of finding aids (xml documents containing metadata about collections of stuff, e.g. the papers of a 19th century lawyer) by, essentially, installing the tool (making some changes to config) and pressing “go”. Actually you don’t need to press go. There are a bunch of similar tools around, but most of them have the dubious virtue of storing and maintaining valuable data inside a proprietary or arbitrary database, and being a lot more complicated to set up.

E.g. the typical workflow for one of these tools, we’ll call it “Fabio”, goes like this:

  1. Archivist creates a complicated XML file containing metadata about some collection of stuff.
  2. Archivist creates or uses an existing XSL file to transform this into data Fabio understands.
  3. Fabio loads the data, and there are almost certainly errors in it because (a) the XML had mistakes in it and/or (b) the XSL had errors in it.
  4. Archivist discovers some errors and fixes either (a) the XML (and reimports), (b) the XSL (and reimports), or (c) the data in Fabio. Probably (c) because Fabio reimporting data into Fabio is a pain, and the whole point of Fabio is it’s supposed to be an “all in one” solution once you get your data munged into it.
  5. Later, more errors are discovered and fixed by the process listed in 4.

Now, the problem with all of this, as I see it, is that it’s completely nuts.

  • People who already know (to some extent) how to create and maintain XML and XSL files now need to learn how to use Fabio. Knowledge of Fabio is relatively useless knowledge (when you change jobs, the new place probably doesn’t use Fabio, but a completely different but equally stupid product).
  • Corrections may be made in either the raw data (XML), reusable scripts (XSL), or Fabio’s black box database (???). Later corrections can easily overwrite earlier corrections if, for example, one mistake is fixed inside Fabio and then another is fixed in the XML.
  • If Fabio stops being maintained or you just want to stop using it, all (or much) of your valuable data is in Fabio’s black box database. Even if you know how it’s structured you may lose stuff getting your data back out.
  • The XML repository needs separate version control anyway, otherwise what happens if you make a change to your XSL to fix one import and then need to reimport another file that worked before but doesn’t work now.
  • Data may be lost during the import process and you won’t know.
  • Fabio needs to provide an API to expose the data to third parties. If it doesn’t expose a particular datum (e.g. because it was lost on import, or Fabio’s developers haven’t gotten around to it yet, you’re out of luck).
  • Fabio may have stupid flaws, e.g. provide unstable or ugly URLs — but that’s between you and your relevant committee, right?

My solution is intended to be as thin as possible and do as little as possible. In particular, my solution wraps a user interface around the raw data without changing the workflow you have for creating and maintaining that data. My solution is called metaview (for now) but I might end up naming it something that has an available .com or .net address (metaview.com is taken).

  • I don’t “import” the data. It stays where it is… in a folder somewhere. Just tell me where to find it. If you decide to stop using me tomorrow, your data will be there.
  • If you reorganize your data the urls remain unchanged (as long as you don’t rename files and files have unique names — for now)
  • Unless you’re maintaining the UI you don’t need to understand it.
  • You fix errors in your data by fixing errors in YOUR data. If you stop using me tomorrow, your fixes are in YOUR data.
  • I present the raw data directly to users if they want it (with the UI wrapped around it) so that there’s no need to wait for an API to access some specific datum, or worry about data loss during import processes.
  • Everything except base configuration (i.e. what’s my DB password and where’s the repository) is automatic.
  • I don’t try to be a “one stop solution”. No I don’t provide tools for cleaning up your data — use the tools you already use, or something else. No I don’t do half-assed version control — use a full-assed version control solution. Etc. I don’t even think I need to do a particularly good job of implementing search — since Google et al do this already. I just need to make sure I expose everything (and I mean everything) to external search engines.

This has me thinking quite a bit about markup languages and templating systems. At first, I tried to decide what kind of template system to use. The problem for me is that there seems to a number of templating systems for any given web development stack that is some multiple of the nth power of people using that stack and the number of components in the stack. So if you’re looking at PHP (roughly two bazillion users) and MySQL (three bazillion) and Apache (four…) that’s a metric frackton of options, and knowledge of any one of those is way over on the bad side of the Knowledge Goodness Scale.

Aside: my unoriginal Knowledge Goodness Scale. This isn’t original, but I try to acquire “good” knowledge and try to avoid “bad”. The more times, contexts, and situations in which knowledge is accurate, useful, and applicable, the better it is. So knowledge about how to understand and evaluate information (such as basic logic, understanding of probability and statistics, understanding of human nature) is incredibly far over on the good end. Knowledge of how to build templates for a “content management system” that only works with PHP 5.2.7 with MySQL 5.x and Apache 2.x is closer to the bad end.

It follows that if you are going to force your users to learn something, try to make it good stuff, not bad stuff. So, let’s continue…

Archivists try to preserve knowledge and/or create new — preferable good — knowledge. We don’t produce index information or metadata about something because we want to have to do it again some day. The knowledge they’re dealing with is often very specific, but to make it good it can still be accurate and applicable across time. Developing an engine based on a bunch of technologies which themsevles are unlikely to be useful across a span of time and contexts is not a good start. (Card indexes have lasted a long time. If your electricity goes out or your server crashes you can still use them today.)

So, my solution involves requiring users to change their lives as little as possible, learn as little as possible, and build on their existing good knowledge rather than acquire new bad knowledge. Instead of figuring out the foibles of Fabio, they can learn how to better create and maintain raw data.

So, what’s the approach?

  • XML is rendered using XSL — on the browser or the server. If you want the XML, click the XML link. (It looks the same on modern browsers — folks with primitive browsers will need server-side rendering.)
  • The templating system is XSL.
  • The database contains index information, but is built dynamically from the repository (as needed).

Of all the different markup languages around, XML is probably the best. It satisfies much of the original intent of HTML — to truly separate intention from presentation (order is still much to important in XML — it’s quite a struggle to reorder XML content via XSL in a nice, flexible way). It’s very widely used and supported. Many people (my target audience in particular) already know it. And it’s not limited to a particular development stack.

XSL is part of XML, so it’s easy for people who already use XML to grok, and again it’s not limited to a particular development stack.

There’s no escapting binding oneself to a development stack for interactivity — so metaview is built using the most common possible free(ish) technologies — i.e. MySQL, PHP, JavaScript, and Apache. Knowledge of these tools is probably close to the least bad knowledge to force on prospective developers/maintainers/contributors.

Less Bad Options

I do have some misgivings about two  technical dependencies.

XML has many virtues, but it sucks to write. A lot of what I have to say applies just as much to things like blogs, message boards, and content management systems”, but requiring users of your message board to learn XML and XSL is … well … nuts. XML and XSL for blog entries is serious overkill. If I were making a more general version of metaview (e.g. turning it into some kind of content management system, with online editing tools) I’d probably provide alternative markup options for content creators. Markdown has many virtues that are in essence the antithesis of XML’s virtues.

Using Markdown to Clean Up XML

Markdown — even more than HTML — is all about presentation, and only accidentally discloses intention (i.e. the fact you’re making something a heading might lead one to infer that it’s important, etc.). But unlike HTML (or XML) Markdown is easy and intuitive to write (anyone who has edited ASCII files is already 75% of the way there) and the marked up text looks good as is (one of its design features). There are a ton of “similar” markup languages, but they are all either poor clones of HTML (the worst being bbcode) or just horrible to look at (e.g. Wiki syntax). Markdown also lets you insert HTML making it (almost) supremely flexible should the need arise. So, if I wanted to create an alternative method for maintaining content, Markdown seems like a nice option.

Markdown also seems like a nice way of embedding formatted text inside XML without polluting the XML hierarchy… e.g. rather than allowing users to use some toy subset of HTML to do rudimentary formatting within nodes, which makes the DTD and all your XML files vastly more complex, you could simply have <whatever type=”text/markdown”> and then you can have the XSL pass out the text as <pre class=”markdown”> which will look fine, but can be made pretty on the client-side by a tiny amount of JavaScript. In a sense, this lets you separate meaningful structure from structure that purely serves a presentational goal — in other words, make your XML cleaner, easier to specify, easier to write correctly, easier to read, and easier to parse.

My other problem is PHP. PHP is popular, free, powerful, and it even scales. It’s quite easy to learn, and it does almost anything I need. I’m tired of the “PHP sucks” bandwagon, but as I’ve been writing more and more code in it I am really starting to wonder “why do I hate it so much?” Well, I won’t go into that topic right now — others have covered this ground in some detail — (e.g. I’m sorry but PHP sucks and What I don’t like about PHP) — but there’s also the fact that it’s popular, free, powerful, and scales. Or to put it another way, PHP sucks, but it doesn’t matter. It would be nice to implement all this in Python, say, but then how many web servers are set up to serve Python-based sites easily? While it may be painful to deal with PHP configuration issues, the problem is fixable without needing root privileges on your server.

So while I think I’m stuck with PHP, I can at least (a) stick to as nice a subset of it as possible (which is hard — I’m already using two different XML libraries), and (b) write as little code as possible. Also I can architect the components to be as independent as possible so that, for example, the indexing component could be replaced with something else entirely without breaking the front end.