Tabs vs. Spaces

I found this article after googling “tabs vs spaces” — and I did that because I was reading an amusing (and apparently very old) treatise on How to Write Unmaintainable Code which seems mostly to be a litany of defects in C, C++, and Java as programming languages. Using tabs instead of spaces is listed as a way of writing unmaintainable code.

Seriously?

I use tabs for indenting, and have never really seen any reason not to. At my last job, there was a rather stern rule to use spaces, only, for indentation, and most of the code was indented by two space increments (I prefer four). I read Jamie Zawinski’s post in an effort to understand where the “anti-tab” folks were coming from.

I just care that two people editing the same file use the same interpretations, and that it’s possible to look at a file and know what interpretation of the TAB character was used, because otherwise it’s just impossible to read.

–Jamie Zawinski

Actually, I don’t really understand his position at all. Having said this, he then goes into technical minutiae about how to get emacs to automatically replace tabs with spaces when you save or somesuch.

Here’s my view. I like to indent stuff fairly clearly (as I said earlier — four spaces). I like to avoid typing more than necessary, and I particularly hate having to precisely type four spaces. I even more particularly hate having to backspace over four spaces. Now, sometimes when I paste code from somewhere else into my own code it will be indented somewhat randomly (e.g. if the person was using spaces for indentation and preferred two or eight space indents). In this case, depending on how fussy I’m feeling, I’ll convert the spaces to tabs or just get it roughly right (so its base indentation level is correct) and leave it alone.

Now there’s a simple test you can apply to either side of a question. What if everyone did things your way? Well, if everyone used tabs then everyone would see code indented just as they liked, code files would be smaller, and removing or adding a level of indentation would always require just one keystroke.

There are cases where tabs may cause misalignment, e.g. one post on the subject showed if you’re breaking out a function’s parameters because it has a lot of them, and you want to line up the second parameter with the first, then using tabs may ruin your alignment (but so does everything else), therefore use spaces — but this problem can be solved by using a less idiotic indentation scheme (indent all the parameters one level further than the function). Any problems caused by inconsistent tab preferences are just as true of spaces when using variable width fonts, which is quite common these days (oddly enough).

Here’s the (simplified) example:

    foo( int i,
         int j
       );

This (potentially) breaks if indented with tabs (or a mixture of tabs and spaces, but no-one is advocating that) and editors with different preferences are used). (It breaks in lots of other situations too, which is why I’d write it this way:

    foo(
        int i,
        int j
    );

or if you’re one of those people:

    foo
    (
        int i,
        int j
    );

(I prefer the first option — but I don’t think the second option is stupid.)

The problems all seem to result from mixing spaces and tabs, which can be solved just as easily by converting leading spaces into tabs when you save as the reverse, and has all the other advantages listed.

I really don’t care that much — in my previous job I cheerfully adhered to the spaces not tabs rule. But I still thought it was stupid.

I have a feeling that the roots of this war lie in the whole emacs, vim, pico thing. Basically tabs are a pain in the butt for users of emacs and vim so they’re convinced it’s tabs and not their idiotic text editors that are the problem.

Post Script

One possible reason for such vehement preferences that just occurred to me is that many popular version control systems aren’t smart enough to ignore whitespace, so if Jill uses tabs and Joe uses spaces a whole bunch of badness will ensue. (Again, in my previous job there was similar angst over Mac classic vs. Windows / DOS vs. Unix / Mac OS X newline standards.) But this doesn’t mean one side is right and the other is wrong, just that for a given situation it may be a good idea to pick one or the other and stick with it.

Random Thoughts on Improving Internet Security

Disclaimer

IANACSE (I am not a computer security expert.) But…

When I was in my final year of high school, I had the opportunity to study with the Australian Mathematics Olympiad Squad. I didn’t make it into the team, but a friend of mine and I got close enough to be invited to the training, and be lectured to by Paul Erdos for a week. I’d love to say that this was an inspiring experience, but unfortunately the impenetrability of his accent was exceeded only by the material he was covering. It didn’t help that, alone among the participants, my friend and I hadn’t spent years in gifted programs and/or at universities getting exposed to graduate-level math.

For me, the highlight was a field trip to a presentation at a math conference about cryptography. I only understood the outlines of what the speaker was talking about at the time, but the subject matter was the theory underlying what is now known as public key cryptography. So, attending and not understanding that presentation, and a hazily recalled education in Pure Math, are my only special claims to domain knowledge.

The Problem(s)

The most commonly used internet protocols — http, ftp, and POP/IMAP/sendmail — are hopelessly insecure.

The standard solution for http is to switch to https.

With ftp at least there’s sftp. If your ISP doesn’t support sftp, use another ISP.

Email is pretty much a lost cause because even if your connection to your email provider is secure, the email transmission is not going to be, so anyone who cares about email is encrypting their email content using PGP or something similar.

I haven’t had an email exchange with anyone, ever, that required me to use PGP, so it’s obviously not very popular. By now, PGP should be built into every mail client and operate transparently (it would make spam harder and more expensive to send, too), instead our best option is usually webmail via https, or a proprietary solution like Microsoft Exchange or Lotus Notes (which has, laudably, been secure from the beginning — it’s a shame it sucks dead dog’s balls in pretty much every other respect).

OK, let’s ignore everything except http

I’ve recently been looking at implementing some kind of security for logging in to websites over http. The usual, simple solution for this is to switch over to https, but the vast majority of the world’s web servers are serving http, and this includes all kinds of services with logins and passwords that people don’t really think too carefully about. How likely is it that some username/password combination a given person uses for an insecure website (e.g. a blog, forum, or whatever) is also used for a secure website somewhere else? Even if https is secure (which is open to doubt), it’s undermined by the insecurity of http.

Here’s the actual form code for Digg.com‘s login form (modulo whitespace):

<form method="post" action="/login/prepare/digg">
<div class="form-row">
<label class="dialog-label" for="login-username">Digg Username</label>
<input class="login-digg-username text" name="username" value="" type="text">
</div>
<div class="form-row">
<label class="dialog-label" for="login-password">Digg Password</label>
<input name="password" class="login-digg-password text" type="password"> <a class="dialog-link forgot-link" href="/login">Lost username or password?</a>
</div>
<div>
<input name="persistent" checked="checked" type="checkbox"> <label class="inline"><strong>Keep me logged in on this computer</strong></label><br>
<input value="Login" type="submit">
</div>
</form>

I’m not trying to single out Digg — it’s just an example of a large scale, popular site that requires user logins and offers zero security. Facebook — a very high profile, popular site — is just as stupidly insecure (the relevant code is a bit harder to read). Why isn’t this a scandal?

It seems to me that it’s criminally negligent of the folks running these sites, and the people developing the most popular open source website software — phpbb, wordpress, drupal, etc. — not to have addressed this, when the solutions are so very straightforward and have been publicly and freely available for so long. Apple got into quite a bit of hot water — and rightly so — for (allegedly) not sufficiently securing MobileMe chatter between web apps and servers, but many of us spend a lot of time on all kinds of websites requiring passwords that make no real attempt at keeping our information safe.

  1. By default, during the setup of any of these programs, the admin should be forced to provide an encryption key, or — better — set parameters for automatically generating such a key for the website. Ideally the key would be refreshed periodically (or even created on-the-fly if the horsepower is available). Some security is better than no security at all, so even if the default key is “only” 64-bit this would be very helpful.
  2. The login page (and any other page where the user enters sensitive information) should simply incorporate JavaScript that takes the public key supplied by the server and encrypts it before posting it back to the website. Encryption in JavaScript, even on fairly slow machines and browsers, is close to instantaneous, and could be done in the background. If JavaScript is disabled, the code can warn the user and fall back to the usual (insecure) method.
  3. The web server then decrypts your private information using its private key.
  4. All such programs should make it easy for users to have their password sent to them encrypted via a supplied public key. (I.e. tell the user where to go to get crypto software to make their own key, and then allow the user to provide a public key (perhaps even store it in their profile) and use it to encrypt password reminders, etc., when necessary. The same techniques should be used to handle “secret question” transactions and the like (obviously).

Correction: as Andrew, my loyal reader, points out — servers shouldn’t store passwords at all. The server should store the hash, and for login attempts the server should ideally provide “salt” which should be added to the hashed password and encrypted before sending. Then, a hacker probably can’t “replay” the encrypted/hashed username/password combination to break in (since they won’t usually be able to enter the session which had that particular salt). Even if the server is totally compromised, no cleartext passwords are stored in the system. It follows that users can never have their old passwords sent to them, they can only be given an opportunity to reset their passwords. If a web service offers to send your password to you, avoid it if you can and treat it as utterly insecure otherwise.

The problem is that, in the end, the password restoration process is only as secure as email, so while the server shouldn’t store passwords and should allow resets instead of sending old passwords, ultimately you’ll need some mechanism to restore access, and if it goes over email we’re back to hopeless insecurity.

Steps one to three, of course, are essentially what https does (but only applied to sensitive data, rather than the whole web page), but has a number of added benefits. It allows reasonable levels of security on commodity http servers. And it will make https even more secure, since https is currently a single point of failure. Here are random hackers discussing methods for cracking or spoofing https. (Do you think your local Savings and Loan or Credit Union paid to have any additional security beyond https for its online banking software?) And it will give criminals headaches in trying to deal with a bizarre cornucopia of — possibly layered — security protocols. (It’s better to have ten different and not entirely reliable layers of security than one that you’re convinced is incredibly good — even if it is incredibly good.)

If nothing else, it’s a competitive advantage. After all, no security is impregnable — the trick is to be secure enough that would-be hackers pick an easier target.

Many have been killed in the attempt…

The fundamental idea behind verification is that one should think about what a program is supposed to do before writing it. Thinking is a difficult process that requires a lot of effort. Write a book based on a selection of distorted anecdotes showing that instincts are superior to rational judgment and you get a best seller. Imagine how popular a book would be that urged people to engage in difficult study to develop their ability to think so they could rid themselves of the irrational and often destructive beliefs they now cherish. So, trying to get people to think is dangerous. Over the centuries, many have been killed in the attempt. Fortunately, when applied to programming rather than more sensitive subjects, preaching rational thought leads to polite indifference rather than violence. However, the small number of programmers who are willing to consider such a radical alternative to their current practice will find that thinking offers great benefits. Spending a few hours thinking before writing code can save days of debugging and rewriting.
— Leslie Lamport (most noted for creating LaTeX) in an interview here.

The interview starts with a lot of failed attempts to summarize the thrust of Lamport’s various papers, but eventually gets quite interesting.

Digital Archeology and Markup Options

I’m currently developing a tool for archivists which will allow them — or indeed anyone — to “publish” a repository of finding aids (xml documents containing metadata about collections of stuff, e.g. the papers of a 19th century lawyer) by, essentially, installing the tool (making some changes to config) and pressing “go”. Actually you don’t need to press go. There are a bunch of similar tools around, but most of them have the dubious virtue of storing and maintaining valuable data inside a proprietary or arbitrary database, and being a lot more complicated to set up.

E.g. the typical workflow for one of these tools, we’ll call it “Fabio”, goes like this:

  1. Archivist creates a complicated XML file containing metadata about some collection of stuff.
  2. Archivist creates or uses an existing XSL file to transform this into data Fabio understands.
  3. Fabio loads the data, and there are almost certainly errors in it because (a) the XML had mistakes in it and/or (b) the XSL had errors in it.
  4. Archivist discovers some errors and fixes either (a) the XML (and reimports), (b) the XSL (and reimports), or (c) the data in Fabio. Probably (c) because Fabio reimporting data into Fabio is a pain, and the whole point of Fabio is it’s supposed to be an “all in one” solution once you get your data munged into it.
  5. Later, more errors are discovered and fixed by the process listed in 4.

Now, the problem with all of this, as I see it, is that it’s completely nuts.

  • People who already know (to some extent) how to create and maintain XML and XSL files now need to learn how to use Fabio. Knowledge of Fabio is relatively useless knowledge (when you change jobs, the new place probably doesn’t use Fabio, but a completely different but equally stupid product).
  • Corrections may be made in either the raw data (XML), reusable scripts (XSL), or Fabio’s black box database (???). Later corrections can easily overwrite earlier corrections if, for example, one mistake is fixed inside Fabio and then another is fixed in the XML.
  • If Fabio stops being maintained or you just want to stop using it, all (or much) of your valuable data is in Fabio’s black box database. Even if you know how it’s structured you may lose stuff getting your data back out.
  • The XML repository needs separate version control anyway, otherwise what happens if you make a change to your XSL to fix one import and then need to reimport another file that worked before but doesn’t work now.
  • Data may be lost during the import process and you won’t know.
  • Fabio needs to provide an API to expose the data to third parties. If it doesn’t expose a particular datum (e.g. because it was lost on import, or Fabio’s developers haven’t gotten around to it yet, you’re out of luck).
  • Fabio may have stupid flaws, e.g. provide unstable or ugly URLs — but that’s between you and your relevant committee, right?

My solution is intended to be as thin as possible and do as little as possible. In particular, my solution wraps a user interface around the raw data without changing the workflow you have for creating and maintaining that data. My solution is called metaview (for now) but I might end up naming it something that has an available .com or .net address (metaview.com is taken).

  • I don’t “import” the data. It stays where it is… in a folder somewhere. Just tell me where to find it. If you decide to stop using me tomorrow, your data will be there.
  • If you reorganize your data the urls remain unchanged (as long as you don’t rename files and files have unique names — for now)
  • Unless you’re maintaining the UI you don’t need to understand it.
  • You fix errors in your data by fixing errors in YOUR data. If you stop using me tomorrow, your fixes are in YOUR data.
  • I present the raw data directly to users if they want it (with the UI wrapped around it) so that there’s no need to wait for an API to access some specific datum, or worry about data loss during import processes.
  • Everything except base configuration (i.e. what’s my DB password and where’s the repository) is automatic.
  • I don’t try to be a “one stop solution”. No I don’t provide tools for cleaning up your data — use the tools you already use, or something else. No I don’t do half-assed version control — use a full-assed version control solution. Etc. I don’t even think I need to do a particularly good job of implementing search — since Google et al do this already. I just need to make sure I expose everything (and I mean everything) to external search engines.

This has me thinking quite a bit about markup languages and templating systems. At first, I tried to decide what kind of template system to use. The problem for me is that there seems to a number of templating systems for any given web development stack that is some multiple of the nth power of people using that stack and the number of components in the stack. So if you’re looking at PHP (roughly two bazillion users) and MySQL (three bazillion) and Apache (four…) that’s a metric frackton of options, and knowledge of any one of those is way over on the bad side of the Knowledge Goodness Scale.

Aside: my unoriginal Knowledge Goodness Scale. This isn’t original, but I try to acquire “good” knowledge and try to avoid “bad”. The more times, contexts, and situations in which knowledge is accurate, useful, and applicable, the better it is. So knowledge about how to understand and evaluate information (such as basic logic, understanding of probability and statistics, understanding of human nature) is incredibly far over on the good end. Knowledge of how to build templates for a “content management system” that only works with PHP 5.2.7 with MySQL 5.x and Apache 2.x is closer to the bad end.

It follows that if you are going to force your users to learn something, try to make it good stuff, not bad stuff. So, let’s continue…

Archivists try to preserve knowledge and/or create new — preferable good — knowledge. We don’t produce index information or metadata about something because we want to have to do it again some day. The knowledge they’re dealing with is often very specific, but to make it good it can still be accurate and applicable across time. Developing an engine based on a bunch of technologies which themsevles are unlikely to be useful across a span of time and contexts is not a good start. (Card indexes have lasted a long time. If your electricity goes out or your server crashes you can still use them today.)

So, my solution involves requiring users to change their lives as little as possible, learn as little as possible, and build on their existing good knowledge rather than acquire new bad knowledge. Instead of figuring out the foibles of Fabio, they can learn how to better create and maintain raw data.

So, what’s the approach?

  • XML is rendered using XSL — on the browser or the server. If you want the XML, click the XML link. (It looks the same on modern browsers — folks with primitive browsers will need server-side rendering.)
  • The templating system is XSL.
  • The database contains index information, but is built dynamically from the repository (as needed).

Of all the different markup languages around, XML is probably the best. It satisfies much of the original intent of HTML — to truly separate intention from presentation (order is still much to important in XML — it’s quite a struggle to reorder XML content via XSL in a nice, flexible way). It’s very widely used and supported. Many people (my target audience in particular) already know it. And it’s not limited to a particular development stack.

XSL is part of XML, so it’s easy for people who already use XML to grok, and again it’s not limited to a particular development stack.

There’s no escapting binding oneself to a development stack for interactivity — so metaview is built using the most common possible free(ish) technologies — i.e. MySQL, PHP, JavaScript, and Apache. Knowledge of these tools is probably close to the least bad knowledge to force on prospective developers/maintainers/contributors.

Less Bad Options

I do have some misgivings about two  technical dependencies.

XML has many virtues, but it sucks to write. A lot of what I have to say applies just as much to things like blogs, message boards, and content management systems”, but requiring users of your message board to learn XML and XSL is … well … nuts. XML and XSL for blog entries is serious overkill. If I were making a more general version of metaview (e.g. turning it into some kind of content management system, with online editing tools) I’d probably provide alternative markup options for content creators. Markdown has many virtues that are in essence the antithesis of XML’s virtues.

Using Markdown to Clean Up XML

Markdown — even more than HTML — is all about presentation, and only accidentally discloses intention (i.e. the fact you’re making something a heading might lead one to infer that it’s important, etc.). But unlike HTML (or XML) Markdown is easy and intuitive to write (anyone who has edited ASCII files is already 75% of the way there) and the marked up text looks good as is (one of its design features). There are a ton of “similar” markup languages, but they are all either poor clones of HTML (the worst being bbcode) or just horrible to look at (e.g. Wiki syntax). Markdown also lets you insert HTML making it (almost) supremely flexible should the need arise. So, if I wanted to create an alternative method for maintaining content, Markdown seems like a nice option.

Markdown also seems like a nice way of embedding formatted text inside XML without polluting the XML hierarchy… e.g. rather than allowing users to use some toy subset of HTML to do rudimentary formatting within nodes, which makes the DTD and all your XML files vastly more complex, you could simply have <whatever type=”text/markdown”> and then you can have the XSL pass out the text as <pre class=”markdown”> which will look fine, but can be made pretty on the client-side by a tiny amount of JavaScript. In a sense, this lets you separate meaningful structure from structure that purely serves a presentational goal — in other words, make your XML cleaner, easier to specify, easier to write correctly, easier to read, and easier to parse.

My other problem is PHP. PHP is popular, free, powerful, and it even scales. It’s quite easy to learn, and it does almost anything I need. I’m tired of the “PHP sucks” bandwagon, but as I’ve been writing more and more code in it I am really starting to wonder “why do I hate it so much?” Well, I won’t go into that topic right now — others have covered this ground in some detail — (e.g. I’m sorry but PHP sucks and What I don’t like about PHP) — but there’s also the fact that it’s popular, free, powerful, and scales. Or to put it another way, PHP sucks, but it doesn’t matter. It would be nice to implement all this in Python, say, but then how many web servers are set up to serve Python-based sites easily? While it may be painful to deal with PHP configuration issues, the problem is fixable without needing root privileges on your server.

So while I think I’m stuck with PHP, I can at least (a) stick to as nice a subset of it as possible (which is hard — I’m already using two different XML libraries), and (b) write as little code as possible. Also I can architect the components to be as independent as possible so that, for example, the indexing component could be replaced with something else entirely without breaking the front end.

Mac Text Editors

One of the most baffling gaps in Mac third-party software is programming-oriented text editors with autocomplete. Now, obviously there’s XCode, vim, and emacs (which, being free, are all something of a problem for would-be competitors), and a bunch of cross-platform tools like Eclipse and Komodo, but XCode is really oriented towards Cocoa development and is kind of like using Microsoft Word to edit a shopping list, while the others are all non-Mac-like (and kind of suck anyway).

The Mac heavyweight text-editing champs — BBEdit and TextMate — have many great features, but don’t do autocompletion very well or at all, respectively. (There’s some kind of third-party hack add on for TextMate, but you need to compile it yourself and few seem to be able to get it working.) A lot of Unity developers were using PCs or Virtual Windows boxes to run Visual Studio just because it has an autocompleting text editor that doesn’t suck — that’s how bad it was. (Now that Unity runs on Windows their lives have gotten much easier, which is a pretty sad statement.)

Before you scream at me about how, say, Subethaedit has some kind of autocomplete support, or mention Coda (which has OK autocomplete support) or Espresso (which may one day have halfway decent extensible autocomplete support but right now is a work-in-progress), go and try Visual Studio or, if you’re allergic to Windows how about Realbasic. Realbasic‘s built-in text editor has autocomplete that doesn’t suck, e.g. it recognizes context, knows about intrinsic language constructs as well as library functions and the stuff you’ve declared, and doesn’t incorrectly complete stuff you need to fix constantly or fail to offer “function” when I type in “fun” in a .js file.

I will say this: TextMate’s macro facility is truly awesome (you can type swi->TAB and get a switch statement template, and then tab through the bits you need to change, almost like a context-sensitive-in-place-dialog-box), if this were paired with a proper autocomplete system (like RealBasic’s) it would be the best I’ve seen and then some — maybe 2.0 will have this — but right now the situation on the Mac remains pretty dismal.