ODF vs. OOXML

This is kind of outside my usual bailiwick — or not, I guess — but I read an article on Groklaw concerning the recent brouhaha over Stallman allegedly calling Miguel de Icaza (the guy behind Mono) a traitor — allegedly — for developing Mono. (One line summary: Stallman didn’t call him a traitor over Mono, he called him a traitor for acting as a schill for Microsoft and in particular OOXML.) If you’re interested in open standards, digital archaeology, how standards processes work, and the whole philosophy of free software, the entire — lengthy — article is very much worth reading. This email (to Miguel) in particular struck a nerve:

Re:OOXML. (Score:5, Insightful)
by stilborne (85590) on Monday September 10 2007, @10:41PM (#20548069) Homepage

> but we have to support them both *anyways*, so its not like its a big deal.

Holy mackerel.

First: I really don’t care to get into a pissing match about the deficiencies of OOXML as a possible standard (they are legion and often fundamental; and whether or not you understand that and/or choose to minimize the severity of these things changes nothing). I will say that I’m very happy to finally see at least *some* open documentation for the new Microsoft Office format; that has to make things easier for the people implementing filters. As such I am completely unsurprised that those people are happier than they were a couple years ago. In fact, I’d be surprised if they weren’t. That part is probably something you and I agree on =)

However the quote above is utterly shocking. Let me explain what I mean:

You are right that we have to support both OOXML and ODF out of practicality. But you know what? That sucks. It would be best for everyone if there was only one format to support. Nobody would lose in that scenario, except perhaps the owners of companies with business models that depend on format variance to sell their product.

In the case of document format storage, a standard is truly important because formats (poor or not) that eventually lose implementations over time carve out blank spaces in our history once we can’t read them properly. These same formats are also the source of certain information inequalities in society (e.g. those who can’t obtain an implementation for financial, social or political reasons). This may not matter so much for Acme Inc’s quarterly reports but it sure does for government, health and other socially vital information. Remember when some hurricane Katrina victims couldn’t use the FEMA website because they had slightly older computers? This isn’t a made up boogyman, this is stuff that bites us as a society fairly regularly. Now imagine a hundred years from now when we can still read the constitutions of our countries, research papers, poetry and other examples of human kind’s great literary works that are hundreds or even thousands of years old … but can’t read the documents we’re creating at the start of the 21st century. How will we learn from our history if we can’t study it fully?

Getting proprietary formats out of the way as soon as possible so that we do not extend this mess any further than necessary is absolutely the responsible thing to do in light of our (hopeful) future.

By allowing OOXML to pass from “specification” to “international standard” would be doing exactly that: extending the problem as it will give years if not decades more life to the format. If OOXML was rationally implementable and properly documented, it wouldn’t be as big of an issue. It would be, as you put it, simply suboptimal. The fact of the matter is that OOXML is not rationally implementable and not properly documented. That’s why it lost the recent vote; it wasn’t because of lobbying (and trying to imply that when Microsoft got its hand caught in the cookie jar is pretty ballsy, by the way). Are some interests acting out of concerns for their business models or pet projects when they rally for ODF and against OOXML? I’m sure they are; but that alone isn’t reason to dismiss the fact that OOXML is problematic and that we don’t need two standards (any more than it is to dismiss OOXML just because it comes from Microsoft).

So please, admire OOXML for what it is: a step forward in documenting what historically has been one of the more pernicious sets of file formats we’ve had to deal with; but don’t mistake that for being a reason to make it an international standard which will only prolong the issues that are part and parcel of the Microsoft Office formats, even in this current version of the specification.

I know that having a bunch of people shit on you in public sucks major donkey nuts and certainly would put most rational people into a rather ungracious mood, but please think above that noise and consider with your intelligent mind exactly what you are promoting here by saying “it’d be fine as an ISO standard”.

ODF is currently incomplete (formulas, blah blah blah) but has exactly the right foundations, the right momentum for support across the industry, and the missing parts are being filled in very nicely. Properly, I might add. Those are the attributes that people who care about Freedom should appreciate, respect and support. In this case, that support means being willing to reject a competing specification that is not well suited for such international ratification. And that, in a nutshell, is why this is precisely a “big deal”.

(As an aside, this also shows how much more intelligent the level of debate on Slashdot often gets.)

Digital Archeology and Markup Options

I’m currently developing a tool for archivists which will allow them — or indeed anyone — to “publish” a repository of finding aids (xml documents containing metadata about collections of stuff, e.g. the papers of a 19th century lawyer) by, essentially, installing the tool (making some changes to config) and pressing “go”. Actually you don’t need to press go. There are a bunch of similar tools around, but most of them have the dubious virtue of storing and maintaining valuable data inside a proprietary or arbitrary database, and being a lot more complicated to set up.

E.g. the typical workflow for one of these tools, we’ll call it “Fabio”, goes like this:

  1. Archivist creates a complicated XML file containing metadata about some collection of stuff.
  2. Archivist creates or uses an existing XSL file to transform this into data Fabio understands.
  3. Fabio loads the data, and there are almost certainly errors in it because (a) the XML had mistakes in it and/or (b) the XSL had errors in it.
  4. Archivist discovers some errors and fixes either (a) the XML (and reimports), (b) the XSL (and reimports), or (c) the data in Fabio. Probably (c) because Fabio reimporting data into Fabio is a pain, and the whole point of Fabio is it’s supposed to be an “all in one” solution once you get your data munged into it.
  5. Later, more errors are discovered and fixed by the process listed in 4.

Now, the problem with all of this, as I see it, is that it’s completely nuts.

  • People who already know (to some extent) how to create and maintain XML and XSL files now need to learn how to use Fabio. Knowledge of Fabio is relatively useless knowledge (when you change jobs, the new place probably doesn’t use Fabio, but a completely different but equally stupid product).
  • Corrections may be made in either the raw data (XML), reusable scripts (XSL), or Fabio’s black box database (???). Later corrections can easily overwrite earlier corrections if, for example, one mistake is fixed inside Fabio and then another is fixed in the XML.
  • If Fabio stops being maintained or you just want to stop using it, all (or much) of your valuable data is in Fabio’s black box database. Even if you know how it’s structured you may lose stuff getting your data back out.
  • The XML repository needs separate version control anyway, otherwise what happens if you make a change to your XSL to fix one import and then need to reimport another file that worked before but doesn’t work now.
  • Data may be lost during the import process and you won’t know.
  • Fabio needs to provide an API to expose the data to third parties. If it doesn’t expose a particular datum (e.g. because it was lost on import, or Fabio’s developers haven’t gotten around to it yet, you’re out of luck).
  • Fabio may have stupid flaws, e.g. provide unstable or ugly URLs — but that’s between you and your relevant committee, right?

My solution is intended to be as thin as possible and do as little as possible. In particular, my solution wraps a user interface around the raw data without changing the workflow you have for creating and maintaining that data. My solution is called metaview (for now) but I might end up naming it something that has an available .com or .net address (metaview.com is taken).

  • I don’t “import” the data. It stays where it is… in a folder somewhere. Just tell me where to find it. If you decide to stop using me tomorrow, your data will be there.
  • If you reorganize your data the urls remain unchanged (as long as you don’t rename files and files have unique names — for now)
  • Unless you’re maintaining the UI you don’t need to understand it.
  • You fix errors in your data by fixing errors in YOUR data. If you stop using me tomorrow, your fixes are in YOUR data.
  • I present the raw data directly to users if they want it (with the UI wrapped around it) so that there’s no need to wait for an API to access some specific datum, or worry about data loss during import processes.
  • Everything except base configuration (i.e. what’s my DB password and where’s the repository) is automatic.
  • I don’t try to be a “one stop solution”. No I don’t provide tools for cleaning up your data — use the tools you already use, or something else. No I don’t do half-assed version control — use a full-assed version control solution. Etc. I don’t even think I need to do a particularly good job of implementing search — since Google et al do this already. I just need to make sure I expose everything (and I mean everything) to external search engines.

This has me thinking quite a bit about markup languages and templating systems. At first, I tried to decide what kind of template system to use. The problem for me is that there seems to a number of templating systems for any given web development stack that is some multiple of the nth power of people using that stack and the number of components in the stack. So if you’re looking at PHP (roughly two bazillion users) and MySQL (three bazillion) and Apache (four…) that’s a metric frackton of options, and knowledge of any one of those is way over on the bad side of the Knowledge Goodness Scale.

Aside: my unoriginal Knowledge Goodness Scale. This isn’t original, but I try to acquire “good” knowledge and try to avoid “bad”. The more times, contexts, and situations in which knowledge is accurate, useful, and applicable, the better it is. So knowledge about how to understand and evaluate information (such as basic logic, understanding of probability and statistics, understanding of human nature) is incredibly far over on the good end. Knowledge of how to build templates for a “content management system” that only works with PHP 5.2.7 with MySQL 5.x and Apache 2.x is closer to the bad end.

It follows that if you are going to force your users to learn something, try to make it good stuff, not bad stuff. So, let’s continue…

Archivists try to preserve knowledge and/or create new — preferable good — knowledge. We don’t produce index information or metadata about something because we want to have to do it again some day. The knowledge they’re dealing with is often very specific, but to make it good it can still be accurate and applicable across time. Developing an engine based on a bunch of technologies which themsevles are unlikely to be useful across a span of time and contexts is not a good start. (Card indexes have lasted a long time. If your electricity goes out or your server crashes you can still use them today.)

So, my solution involves requiring users to change their lives as little as possible, learn as little as possible, and build on their existing good knowledge rather than acquire new bad knowledge. Instead of figuring out the foibles of Fabio, they can learn how to better create and maintain raw data.

So, what’s the approach?

  • XML is rendered using XSL — on the browser or the server. If you want the XML, click the XML link. (It looks the same on modern browsers — folks with primitive browsers will need server-side rendering.)
  • The templating system is XSL.
  • The database contains index information, but is built dynamically from the repository (as needed).

Of all the different markup languages around, XML is probably the best. It satisfies much of the original intent of HTML — to truly separate intention from presentation (order is still much to important in XML — it’s quite a struggle to reorder XML content via XSL in a nice, flexible way). It’s very widely used and supported. Many people (my target audience in particular) already know it. And it’s not limited to a particular development stack.

XSL is part of XML, so it’s easy for people who already use XML to grok, and again it’s not limited to a particular development stack.

There’s no escapting binding oneself to a development stack for interactivity — so metaview is built using the most common possible free(ish) technologies — i.e. MySQL, PHP, JavaScript, and Apache. Knowledge of these tools is probably close to the least bad knowledge to force on prospective developers/maintainers/contributors.

Less Bad Options

I do have some misgivings about two  technical dependencies.

XML has many virtues, but it sucks to write. A lot of what I have to say applies just as much to things like blogs, message boards, and content management systems”, but requiring users of your message board to learn XML and XSL is … well … nuts. XML and XSL for blog entries is serious overkill. If I were making a more general version of metaview (e.g. turning it into some kind of content management system, with online editing tools) I’d probably provide alternative markup options for content creators. Markdown has many virtues that are in essence the antithesis of XML’s virtues.

Using Markdown to Clean Up XML

Markdown — even more than HTML — is all about presentation, and only accidentally discloses intention (i.e. the fact you’re making something a heading might lead one to infer that it’s important, etc.). But unlike HTML (or XML) Markdown is easy and intuitive to write (anyone who has edited ASCII files is already 75% of the way there) and the marked up text looks good as is (one of its design features). There are a ton of “similar” markup languages, but they are all either poor clones of HTML (the worst being bbcode) or just horrible to look at (e.g. Wiki syntax). Markdown also lets you insert HTML making it (almost) supremely flexible should the need arise. So, if I wanted to create an alternative method for maintaining content, Markdown seems like a nice option.

Markdown also seems like a nice way of embedding formatted text inside XML without polluting the XML hierarchy… e.g. rather than allowing users to use some toy subset of HTML to do rudimentary formatting within nodes, which makes the DTD and all your XML files vastly more complex, you could simply have <whatever type=”text/markdown”> and then you can have the XSL pass out the text as <pre class=”markdown”> which will look fine, but can be made pretty on the client-side by a tiny amount of JavaScript. In a sense, this lets you separate meaningful structure from structure that purely serves a presentational goal — in other words, make your XML cleaner, easier to specify, easier to write correctly, easier to read, and easier to parse.

My other problem is PHP. PHP is popular, free, powerful, and it even scales. It’s quite easy to learn, and it does almost anything I need. I’m tired of the “PHP sucks” bandwagon, but as I’ve been writing more and more code in it I am really starting to wonder “why do I hate it so much?” Well, I won’t go into that topic right now — others have covered this ground in some detail — (e.g. I’m sorry but PHP sucks and What I don’t like about PHP) — but there’s also the fact that it’s popular, free, powerful, and scales. Or to put it another way, PHP sucks, but it doesn’t matter. It would be nice to implement all this in Python, say, but then how many web servers are set up to serve Python-based sites easily? While it may be painful to deal with PHP configuration issues, the problem is fixable without needing root privileges on your server.

So while I think I’m stuck with PHP, I can at least (a) stick to as nice a subset of it as possible (which is hard — I’m already using two different XML libraries), and (b) write as little code as possible. Also I can architect the components to be as independent as possible so that, for example, the indexing component could be replaced with something else entirely without breaking the front end.