subversion vs. git

It seems that the indie Mac developer community has spoken, and subversion is the current version (or is it “revision”) control system of choice for the present. Now, it’s important not to blindly accept the indie Mac developer community’s word on such matters, or you may find yourself exclusively developing using Ruby on Rails using TextMate, but I’ve begun wondering if my “love at first sight” reaction to git was actually justified based on the fact that after setting myself up using git on several projects, I (a) couldn’t figure out how to get remote push (or whatever it’s called) working, (b) didn’t want to have to pay for a solution (for large and/or closed source projects), and consequently (c) found myself not using git after a while (going back to my good old “dated archives of giant folders full of files” system of version control).

I’d looked at subversion before, but not really hard enough to really grok it, and at the time I was forced to use perforce by my (then) employer. It had become increasingly annoying to me that subversion was being seamlessly supported by all kinds of tools I used (e.g. Coda) while git pointedly wasn’t. Now there are two excellent graphical clients for subversion — Versions and Cornerstone — as well as integrated support of varying levels of usefulness in Coda, Textmate, XCode, et al. For git there’s GitX — which has the advantage of being free, and the disadvantage of being nowhere near as nice.

There are two other really important differences between git and subversion (that I can see), one of which — I think — plays in git’s favor, while the other can play in either’s favor (but for me it’s subversion’s).

Files Are Not Atomic in git

The first key difference is that git isn’t fundamentally file-centric, or at least that’s the idea. The core elements are branches, revisions, and blobs of code — the signal event being that if you move a function from file A to file B, or rename file A to B there’s a good chance git will recognize these events for what they are automagically. Other systems, such as subversion, will treat the first event as “something disappeared from A, and something appeared in B”, and similarly the second event as “file A was deleted, and file B was created”. If this were git’s only difference from subversion then git would win and subversion would lose, at least for me.

There Is No Central Repository in git

The second key difference in git is that it’s very democratic. In essence, every user’s repository is “the master” and can merge in other people’s changes, or not (assuming it can find them).  Indeed, a core idea of git is that branching (both creating and merging) branches should be “cheap” and thus it should be normal to expect lots of branches lying around. While this seems like a Good Thing from an idealistic point of view, it doesn’t seem to me to match my idea of what a “project” is, and it’s probably why I’ve never used git as it appears to have been intended. Maybe in my heart, I’m just not that wild and crazy.

If we are to infer from git’s strengths what Linus Torvalds expected a project to look like, it’s a bunch of independent developers each with their own unique multi-branched copy of a codebase, exchanging their different takes of the project at arbitrary intervals and then folding or not folding other developers’ various changes into their own branches, and occasionally assembling some kind of coherent “release” version (and presumably closing off and setting aside huge numbers of branches and starting over — to preserve their own sanity if nothing else).

Other Differences

There are other differences between git and subversion that all tend to play in git’s favor:

  • git produces fewer files and smaller repositories
  • git produces a single .git folder for an entire project (instead of scattering .svn folders into every single subfolder the way subversion does)

Neither of these matter much to me since my projects tend to be fairly small, but if I were working on a larger or more complex project the additional complexity of getting git working on my server would probably be justified. Of course I wouldn’t be surprised if subversion switches to a single folder model at some point, making all our lives easier.

Finally, there’s the question of licensing. Subversion is has a liberal BSD/MIT-style license which means that many subversion clients have subversion built in requiring no additional installation. Because git is GPL licensed shareware vendors can’t wrap a closed source program around it, and so you tend to need to install a git client separately from git itself, and then figure out how to get them to talk to each other. (I don’t know why a git client can’t simply do the install for you though, so this is more likely just a sign of git’s relative novelty.)

Mercurial

Mercurial (it’s hg from the command line — yay a chemistry reference!) is in many ways very similar to git but — as some have put it — more Mac-like. (It’s also written in Python, whereas git is cobbled together out of a bunch of different things — notably C and Perl.) Everything I’ve said about git versus subversion* pretty much applies to Mercurial vs. subversion (complete with the lack of good graphical front ends for Mac OS X and the licensing issues).

Edit: here’s an interesting talk about Mercurial and I stand corrected — it’s 95% Python and 5% C. Of course this means Mercurial vs. git comes down to Python versus Perl — no contest for me!

Another edit: one thing Mercurial has going for it is bitbucket.org, which works beautifully and is (apparently) totally free (for now at least). One nice thing about distributed systems is that there’s much less in the way of “stickiness” since everyone has the whole history. Based on my brief experience with Mercurial I’m starting to strongly favor it over git and subversion — maybe enough to use it from the command line and possibly even enough to write a nice front end for it.

* It’s kind of sad that the subversion folks haven’t run with the whole “subversion” theme in their naming conventions. So much potential wasted!

All Decisions Are Provisional

My take on it is that git (/mercurial) is the new hotness, which means we’ll see an explosion of git (/mercurial) front ends over the next couple of years. In the mean time, I’m happy to use subversion and migrate when there’s something as good as Cornerstone for git (/mercural) (or — better yet — Cornerstone adds git (/mercurial) support).

Desktop Development with Java

In my ongoing quest to find a long-term replacement for RealBasic (and preferably as many other random development tools and languages as possible) I’ve been investigating Python + wxPython (which seems to work pretty well but produces remarkably large, sluggish applications), Java + SWING (which, as you might expect of anything developed by Sun and built on top of AWT produces sluggish, ugly apps — on the Mac at any rate), and Java + SWT + JFace (which actually seems like it might not suck, but for which precious little documentation seems to be available outside of an insanely expensive — and somewhat outdated — book).

One of the big downsides of anything involving Eclipse is that there seems to be an “easy 35 step installation process”. WTF? It’s like Lotus Notes all over again. Aside from that, Eclipse is a remarkably fine piece of software, and since it’s built on top of SWT, indeed SWT’s raison d’être, this seems to me to give it front-runner status. Indeed, given Netbeans as an exemplar of SWING development, this is one more reason to avoid SWING like the plague.)

Eclipse’s features make Java’s horribleness tolerable… indeed almost not horrible at all. Hover over a classname and there’s its documentation (in useful form). Declare an object of a class you haven’t imported and it will offer to insert the appropriate line of code for you.  Seriously good autocomplete. Seriously good compiler feedback (errors are flagged inline like spelling mistakes in a word-processor). You still need to look at the resulting code, but really not bad at all. For all of Python’s virtues, it cannot offer anything like that kind of programming experience right now.

Many have been killed in the attempt…

The fundamental idea behind verification is that one should think about what a program is supposed to do before writing it. Thinking is a difficult process that requires a lot of effort. Write a book based on a selection of distorted anecdotes showing that instincts are superior to rational judgment and you get a best seller. Imagine how popular a book would be that urged people to engage in difficult study to develop their ability to think so they could rid themselves of the irrational and often destructive beliefs they now cherish. So, trying to get people to think is dangerous. Over the centuries, many have been killed in the attempt. Fortunately, when applied to programming rather than more sensitive subjects, preaching rational thought leads to polite indifference rather than violence. However, the small number of programmers who are willing to consider such a radical alternative to their current practice will find that thinking offers great benefits. Spending a few hours thinking before writing code can save days of debugging and rewriting.
— Leslie Lamport (most noted for creating LaTeX) in an interview here.

The interview starts with a lot of failed attempts to summarize the thrust of Lamport’s various papers, but eventually gets quite interesting.

Digital Archeology and Markup Options

I’m currently developing a tool for archivists which will allow them — or indeed anyone — to “publish” a repository of finding aids (xml documents containing metadata about collections of stuff, e.g. the papers of a 19th century lawyer) by, essentially, installing the tool (making some changes to config) and pressing “go”. Actually you don’t need to press go. There are a bunch of similar tools around, but most of them have the dubious virtue of storing and maintaining valuable data inside a proprietary or arbitrary database, and being a lot more complicated to set up.

E.g. the typical workflow for one of these tools, we’ll call it “Fabio”, goes like this:

  1. Archivist creates a complicated XML file containing metadata about some collection of stuff.
  2. Archivist creates or uses an existing XSL file to transform this into data Fabio understands.
  3. Fabio loads the data, and there are almost certainly errors in it because (a) the XML had mistakes in it and/or (b) the XSL had errors in it.
  4. Archivist discovers some errors and fixes either (a) the XML (and reimports), (b) the XSL (and reimports), or (c) the data in Fabio. Probably (c) because Fabio reimporting data into Fabio is a pain, and the whole point of Fabio is it’s supposed to be an “all in one” solution once you get your data munged into it.
  5. Later, more errors are discovered and fixed by the process listed in 4.

Now, the problem with all of this, as I see it, is that it’s completely nuts.

  • People who already know (to some extent) how to create and maintain XML and XSL files now need to learn how to use Fabio. Knowledge of Fabio is relatively useless knowledge (when you change jobs, the new place probably doesn’t use Fabio, but a completely different but equally stupid product).
  • Corrections may be made in either the raw data (XML), reusable scripts (XSL), or Fabio’s black box database (???). Later corrections can easily overwrite earlier corrections if, for example, one mistake is fixed inside Fabio and then another is fixed in the XML.
  • If Fabio stops being maintained or you just want to stop using it, all (or much) of your valuable data is in Fabio’s black box database. Even if you know how it’s structured you may lose stuff getting your data back out.
  • The XML repository needs separate version control anyway, otherwise what happens if you make a change to your XSL to fix one import and then need to reimport another file that worked before but doesn’t work now.
  • Data may be lost during the import process and you won’t know.
  • Fabio needs to provide an API to expose the data to third parties. If it doesn’t expose a particular datum (e.g. because it was lost on import, or Fabio’s developers haven’t gotten around to it yet, you’re out of luck).
  • Fabio may have stupid flaws, e.g. provide unstable or ugly URLs — but that’s between you and your relevant committee, right?

My solution is intended to be as thin as possible and do as little as possible. In particular, my solution wraps a user interface around the raw data without changing the workflow you have for creating and maintaining that data. My solution is called metaview (for now) but I might end up naming it something that has an available .com or .net address (metaview.com is taken).

  • I don’t “import” the data. It stays where it is… in a folder somewhere. Just tell me where to find it. If you decide to stop using me tomorrow, your data will be there.
  • If you reorganize your data the urls remain unchanged (as long as you don’t rename files and files have unique names — for now)
  • Unless you’re maintaining the UI you don’t need to understand it.
  • You fix errors in your data by fixing errors in YOUR data. If you stop using me tomorrow, your fixes are in YOUR data.
  • I present the raw data directly to users if they want it (with the UI wrapped around it) so that there’s no need to wait for an API to access some specific datum, or worry about data loss during import processes.
  • Everything except base configuration (i.e. what’s my DB password and where’s the repository) is automatic.
  • I don’t try to be a “one stop solution”. No I don’t provide tools for cleaning up your data — use the tools you already use, or something else. No I don’t do half-assed version control — use a full-assed version control solution. Etc. I don’t even think I need to do a particularly good job of implementing search — since Google et al do this already. I just need to make sure I expose everything (and I mean everything) to external search engines.

This has me thinking quite a bit about markup languages and templating systems. At first, I tried to decide what kind of template system to use. The problem for me is that there seems to a number of templating systems for any given web development stack that is some multiple of the nth power of people using that stack and the number of components in the stack. So if you’re looking at PHP (roughly two bazillion users) and MySQL (three bazillion) and Apache (four…) that’s a metric frackton of options, and knowledge of any one of those is way over on the bad side of the Knowledge Goodness Scale.

Aside: my unoriginal Knowledge Goodness Scale. This isn’t original, but I try to acquire “good” knowledge and try to avoid “bad”. The more times, contexts, and situations in which knowledge is accurate, useful, and applicable, the better it is. So knowledge about how to understand and evaluate information (such as basic logic, understanding of probability and statistics, understanding of human nature) is incredibly far over on the good end. Knowledge of how to build templates for a “content management system” that only works with PHP 5.2.7 with MySQL 5.x and Apache 2.x is closer to the bad end.

It follows that if you are going to force your users to learn something, try to make it good stuff, not bad stuff. So, let’s continue…

Archivists try to preserve knowledge and/or create new — preferable good — knowledge. We don’t produce index information or metadata about something because we want to have to do it again some day. The knowledge they’re dealing with is often very specific, but to make it good it can still be accurate and applicable across time. Developing an engine based on a bunch of technologies which themsevles are unlikely to be useful across a span of time and contexts is not a good start. (Card indexes have lasted a long time. If your electricity goes out or your server crashes you can still use them today.)

So, my solution involves requiring users to change their lives as little as possible, learn as little as possible, and build on their existing good knowledge rather than acquire new bad knowledge. Instead of figuring out the foibles of Fabio, they can learn how to better create and maintain raw data.

So, what’s the approach?

  • XML is rendered using XSL — on the browser or the server. If you want the XML, click the XML link. (It looks the same on modern browsers — folks with primitive browsers will need server-side rendering.)
  • The templating system is XSL.
  • The database contains index information, but is built dynamically from the repository (as needed).

Of all the different markup languages around, XML is probably the best. It satisfies much of the original intent of HTML — to truly separate intention from presentation (order is still much to important in XML — it’s quite a struggle to reorder XML content via XSL in a nice, flexible way). It’s very widely used and supported. Many people (my target audience in particular) already know it. And it’s not limited to a particular development stack.

XSL is part of XML, so it’s easy for people who already use XML to grok, and again it’s not limited to a particular development stack.

There’s no escapting binding oneself to a development stack for interactivity — so metaview is built using the most common possible free(ish) technologies — i.e. MySQL, PHP, JavaScript, and Apache. Knowledge of these tools is probably close to the least bad knowledge to force on prospective developers/maintainers/contributors.

Less Bad Options

I do have some misgivings about two  technical dependencies.

XML has many virtues, but it sucks to write. A lot of what I have to say applies just as much to things like blogs, message boards, and content management systems”, but requiring users of your message board to learn XML and XSL is … well … nuts. XML and XSL for blog entries is serious overkill. If I were making a more general version of metaview (e.g. turning it into some kind of content management system, with online editing tools) I’d probably provide alternative markup options for content creators. Markdown has many virtues that are in essence the antithesis of XML’s virtues.

Using Markdown to Clean Up XML

Markdown — even more than HTML — is all about presentation, and only accidentally discloses intention (i.e. the fact you’re making something a heading might lead one to infer that it’s important, etc.). But unlike HTML (or XML) Markdown is easy and intuitive to write (anyone who has edited ASCII files is already 75% of the way there) and the marked up text looks good as is (one of its design features). There are a ton of “similar” markup languages, but they are all either poor clones of HTML (the worst being bbcode) or just horrible to look at (e.g. Wiki syntax). Markdown also lets you insert HTML making it (almost) supremely flexible should the need arise. So, if I wanted to create an alternative method for maintaining content, Markdown seems like a nice option.

Markdown also seems like a nice way of embedding formatted text inside XML without polluting the XML hierarchy… e.g. rather than allowing users to use some toy subset of HTML to do rudimentary formatting within nodes, which makes the DTD and all your XML files vastly more complex, you could simply have <whatever type=”text/markdown”> and then you can have the XSL pass out the text as <pre class=”markdown”> which will look fine, but can be made pretty on the client-side by a tiny amount of JavaScript. In a sense, this lets you separate meaningful structure from structure that purely serves a presentational goal — in other words, make your XML cleaner, easier to specify, easier to write correctly, easier to read, and easier to parse.

My other problem is PHP. PHP is popular, free, powerful, and it even scales. It’s quite easy to learn, and it does almost anything I need. I’m tired of the “PHP sucks” bandwagon, but as I’ve been writing more and more code in it I am really starting to wonder “why do I hate it so much?” Well, I won’t go into that topic right now — others have covered this ground in some detail — (e.g. I’m sorry but PHP sucks and What I don’t like about PHP) — but there’s also the fact that it’s popular, free, powerful, and scales. Or to put it another way, PHP sucks, but it doesn’t matter. It would be nice to implement all this in Python, say, but then how many web servers are set up to serve Python-based sites easily? While it may be painful to deal with PHP configuration issues, the problem is fixable without needing root privileges on your server.

So while I think I’m stuck with PHP, I can at least (a) stick to as nice a subset of it as possible (which is hard — I’m already using two different XML libraries), and (b) write as little code as possible. Also I can architect the components to be as independent as possible so that, for example, the indexing component could be replaced with something else entirely without breaking the front end.

Software Abstraction Layers

I’m not doing much more than linking two interesting articles I read today. First I read this short piece on 10 Useful Techniques To Improve Your User Interface Designs (I liked this even though it was mainly typographic advice and really more about web pages than apps) and then this very interesting blog entry on Objective-J and Leaky Abstractions.

I’ve always been a bit leery of plugins and toolbox calls from within a given scripting environment. I would go to great lengths in tools as diverse as Authorware, Director, Visual Basic, HyperCard, and RealBasic to build stuff purely using the built-in functions rather than use a “simple” plugin or API call (Visual Basic and RealBasic both let you call the OS toolbox directly with a little extra work). I’ve never really understood why I’m like this, but I think it’s for much the same reason I always preferred Algebra to Calculus or Analysis. I don’t like hairy, open-ended problems. I like working within a well-defined space and treating its boundaries, even when those boundaries are porous, as being absolute.

Anyway, the second article essentially obviates the first article. The point of Cappucino (the class library for which Objective-J was initially developed) is to abstract out annoyances like using CSS to correctly place captions inside buttons or make your text fields look right on different browsers. This is how it should be. I like working in a Walled Garden, but only if the Walled Garden is beautiful. I want my buttons to look good, but I want them to be produced automatically.

Oddly enough, this little insight helps explain my tastes in all kinds of things, from software development tools (I can’t stand Runtime Revolution because it produces ugly user interfaces) to operating systems to programming languages to fields of Mathematics.

Anyway, it’s nice to know that there’s a way to suppress OS X’s text field focus hilite using CSS ( input:focus { outline: none } ). I’ll bear that in mind. And it’s also nice to know that if I learn to build web apps in Cappucino using Objective-J I won’t have to worry about that kind of stupidity ever again.