Crawling the Web like it’s 1999

(In which I compare serving web apps to running a burger chain.)

This is a long post. If you just want to know how to very simply get your dynamically-rendered pages indexed properly by Google (et al), just skip to the tl;dr.

I understand the broad strokes of SEO (“search engine optimization”, i.e. the dark art of getting you stuff to show up in search results) and most of it just turns my stomach and so I prefer not to think about such things, hoping and relying on the good folk who think about such things—and I generally ascribe good intentions to the Google Search team and its competitors—to protect us from the bad folk.

But if you skip the dark practices of SEO—like crafting fake landing pages that look hyper-local but are in fact just honey-traps for generic content (see above), or generating enormous link farms, or creating links for boilerplate content before the real stuff is ready because you know that link age matters even if the content is basically junk (see virtually all tech review sites)—the basics are actually pretty straightforward.

  1. Have good content with headings, titles, and so on that are sensibly structured
  2. Use consistent—and ideally meaningful—urls
  3. Have a sitemap (because it turns out crawlers are dumb)
  4. Generate <meta> tags and canonical <link> tags that are consistent, up-to-date and accurate
  5. Oh yeah, and be a static website (or look like one)
  6. Then… try to get people to link to your content for the right reasons.

It turns out that Google and Bing crawl the web in a very simple-minded way..

The reason Google is Google, and Yahoo was lately sold by Verizon to something called Apollo Global Media is, that to get a url indexed by Yahoo you filled in an application form which listed your urls and how you wanted them indexed and then waited for someone to check it. In my case this was “never”, at best it was “eventually” and the index was only as good as the talent and intentions of the humans who filled in and processed the forms (a mixed bag) and Yahoo’s search code (terrible).

Yahoo is an amazing example of how being first mover can overcome utter incompetence, at least for a while and a few tens of billions of dollars. But hey, Douglas Crockford worked there and wrote Javascript, The Good Parts and came up with JSON.

Meanwhile, Yahoo’s competitors “crawled” the web by checking domains for index pages and crawling them, and then Google wrote by far the best search algorithm for what it found. Also, unlike Yahoo, Lycos, Alta-vista, or Excite, “Google” makes a good verb.


Google and Bing “support javascript” in the sense that they allow a little bit of it to run before they process the content of the page—which is basically the only nod they’ve made to the fact that there are virtually no static websites any more and pretty much haven’t been since the 90s. At least until nodejs gained traction the bones of most web pages were server-side rendered and merely “made interactive” with Javascript.

As a result of this, there’s all kinds of hocus-pocus around how to “look like one” (i.e. a static website) and this falls into three basic camps… maybe it would be better described as two camps, one of which has an eviction notice nailed to it, and a forward base camp whose denizens died of exposure, where this journal entry was found…

  • Server-side rendering. When a page is requested, pre-render the HTML on the server-side. This is how the web “used to work”, and it’s how sites written using PHP lightly dusted with Javascript work now. jQuery is ideally suited for this. If you build a site with WordPress, it’s going to do a pretty good job of all this crap for you.
  • Dynamic rendering“. Not what you think from the title! In a nutshell, pre-render your pages and serve the pre-rendered pages to web-crawlers, but run your site as normal otherwise. This approach was first suggested by Google and is now deprecated.
  • (Re)hydration. This is what Google is recommending now. Um, I think. The article they link is borderline incomprehensible. It’s not so much a “how-to” guide as a meandering think-piece on various approaches that might kind of work perhaps. “Streaming server-side rendering and progressive rehydration”. “Trisomorphic rendering”.

Progressive rehydration is also worth considering, and something that React has landed. With this approach, individual pieces of a server-rendered application are “booted up” over time, rather than the current common approach of initializing the entire application at once. This can help reduce the amount of JavaScript required to make pages interactive, since client-side upgrading of low priority parts of the page can be deferred to prevent blocking the main thread, allowing user interactions to occur sooner after the user initiates them.

Rendering on the Web

What. The. Actual. Fuck?

Server-Side Rendering is Dumb, But it Works

Server-side rendering is something I keep being told is the thing to do by people whose explanations for why it’s the best thing to do are laughably stupid. Sadly, this is a case where the commonly-held conclusion “SSR is a Good Idea, do it” is actually correct, but the underlying reason “pages load faster, it gives a better user experience” is trash. Unfortunately, in the past I tended to dismiss the conclusion because the “reasoning” behind it was trash.

The sad fact is, Google and Bing crawl the web like it’s 1999 and SSR deals with this by giving them faux-static web pages and then all the code and data needed to render them as well, typically actually rendering them a second time. (But hey, if I’m reading this right, React can now just do some of that at a time, so it doesn’t completely block user interaction. Woohoo! Weird, I thought they had previously solved this exact problem the exact the same way before…)

Simple example, suppose you have a server-side rendered ToDo list. Your app comprises code that fetches the data, renders the list, handles user interactions, sends updates to the server, and updates the list.

In the SSR world:

  1. The server that is asked for “index.html” (or whatever) runs ToDoApp()—specifically, a version compiled to run in nodejs—”hydrated” with await get /todolist/123abc inside a function that renders the page boilerplate, and sends that text/html payload to the client. This isn’t really cacheable because the data is volatile. It certainly needs to be generated once per user. It’s also bigger than a static index.html with a script tag to pointing to cached javascript.
  2. The client gets the HTML and displays it. The app appears to have loaded and crawlers are happy because they can index what they now see. Meanwhile the browser loads the code. Owing to caching and other magic the browser has probably already started loading the code and maybe finished or even had it in cache, so the code now runs and maybe asks for the freshest data. (Maybe it can do some magic and use data inlined in the HTML if it’s fresh enough, but in practice not so much.)
  3. ToDoApp()—this version compiled to run in a browser—now runs and re-renders everything (with optimizations to avoid actually re-rendering things that don’t need re-rendering… maybe… but also so it can do things like attach bindings to the DOM elements it didn’t actually render so that it can handle keyboard input, mouse clicks, and so on).
  4. The app is now actually ready to interact with the user. Also, owing to imperfections in the universe, maybe the appearance changes subtly as we discover that the server-side rendering wasn’t quite perfectly in tune with the client-side.

In step 1, the server has to do more work. In a “native app” none of this happens. The app is “just code” on the client and just works. In a (in my opinion) sensibly engineered app, this is just cached data living on some anonymous cache on the edge, and your data center only sees these requests once in a blue moon when the cache is invalidated. In SSR you’re actually doing a bunch of work the client will have to do anyway AND you’re sending a “rendered” UI (text not graphics, but still) which is significantly bigger and 100% not cacheable which you compute every time and sending it to the client.

In step 2, the client gets the UI “pre-rendered” and can display it more quickly than it could render the data using the code. But, it would have received that code and data more quickly in most cases and used less bandwidth. This is a minor difference at the margins. But remember, SSR means that the page and app could be found and thus you have a user. This is why SSR has been “the correct answer” for stupid reasons for so long.

And, the template is sent twice—once hydrated with data and the second time embedded in ToDoApp() so that is knows how to render it again. And the data is sent twice, once in the HTML and again either inline or via a later request. And there are two versions of ToDoApp, one that runs in nodejs and the other than runs in the browser. The code is necessarily more complex and has to do things like reconcile the DOM it has been passed with the one it knows how to create or, more likely, lock user interaction until the code is available and just blow away the facade later (this is in practice what usually happens). The server did more work and sent more data. It was less cacheable. The client received more data and did more work. Yes, the user possibly saw the pre-rendered page a little bit quicker, but it wasn’t actually working yet.

So, in a nutshell—or maybe a recycled cardboard clamshell:

If this were a burger chain, in order to give you the burger “quicker”, they make the burger out of fiberglass at the distribution center and ship it to you along with the recipe. But they know you really want a fresh burger, so you can’t and don’t eat this burger. It just looks like a burger! Instead once you have the recipe you can ask for the ingredients or perhaps they also ship you the ingredients along with the fiberglass burger, and the recipe comes with instructions to the chef snatch the burger out of your hands and replace it with an actual edible burger as soon as it’s ready. (In this analogy, you have a chef and a kitchen with you—it’s the browser.)

Also, at the burger company HQ, the burger recipe has to be “compiled” into two different versions, one that produces fiberglass burgers and another that produces actual burgers. It’s a lot easier to test the fiberglass burgers so most of the testing concentrates on those. They’re pretty sure if the fiberglass burger looks good the actual burger is fine. Also, if you have an old recipe but think it’s the latest and try it on the ingredients you receive, bad things happen.

But a web app that is actually a non-interactive facade is less stark than an inedible fiberglass burger. So users don’t mind it as much as they would breaking their teeth on fiberglass.

At this point, the client-side rendered app is done, the customer has a fresh, edible burger (because, in this analogy, the chef can prepare the burger really quick, in fact faster than the distribution center can make the fake one out of fiberglass, because you have a personal chef while the folks at the distribution center have to make custom fiberglass burgers for many, many customers). The UI is visible and interactive.

In the SSR world there’s a thunk. A pretend UI is visible and rendered, but the code isn’t bound to it yet. So the app pretty much locks up and is non-interactive while the code catches up, re-renders the UI in the process of figuring out which element needs to be bound to what event handler, and so on, and by the end of Step 4 you can actually use the app. Meanwhile, all the CPU savings (and more) from getting a pre-rendered UI are wasted and you wasted bandwidth to get here because (a) you sent the data twice, (b) most of it wasn’t cache-friendly, (c) two different versions of the code—both of which were more complex because of all this—had to run to completion, and (d) everything involved is bigger and more complex.

This isn’t theoretical, by the way, e.g. every time I use the HBO—sorry “max”—or Netflix apps on my AppleTV I cringe as the UI renders fairly quickly (but slower than you’d think if there were sensible caching) and then is locked up for a mystery pause, and then unlocks with a slight “judder” because the re-rendered UI doesn’t have the exact same animation state as the pre-rendered UI.

A lot of the apps we’re talking about are a lot more complicated than “browse a bunch of rectangles and click on one” (while insistently playing previews you’re not interested on loop).

I don’t mind the aesthetic twitches—it all still looks pretty slick—but it galls me that I often end up opening the wrong show because the UI isn’t handling events for some random number of seconds after first rendering.

You can eliminate some of this craziness with a LOT of tooling. E.g. you might not fetch fresh data until the user changes something (eliminating the need to send over data that’s already in the DOM), but once you do need to fetch, you probably need to over-fetch and while conceptually this might seem simple-ish, in practice GraphQL (which is designed around this) relies on tight-binding between client and server because the server needs to know exactly what data is needed to render exactly what client state.

OK it’s dumb and inefficient—but it works

Without SSR your customer won’t ever find your app in the first place. That’s why SSR is the “right conclusion” even though it’s technically worse for the users, the programmers, and arguably even Google and Bing even though they’re really the only ones “benefiting” from all this.

Maybe back in 2010 when many mobile browsers were running on single core CPUs this may have made some sense. But for the last five years, the single core performance of phones has rivaled that of desktops and servers, and almost no-one has a single or even dual core device. And 100% of that CPU is yours—the server is splitting its CPU among thousands of users. (If you care about carbon footprint, it’s also worse for the environment. CPUs run less efficiently non-linearly under load, so a CPU that has to do 10 things will use more power than 10 otherwise idle CPUs doing the same things for themselves.)

All this simply allows crawlers to continue to crawl the web like every page is either static content in a directory or has been engineered to behave exactly as though it were.

Meanwhile, your code is more complex, your infrastructure is more complex, your app is bigger, more bandwidth is being wasted, more data is being consumed, the app is slower to respond…

“Dynamic Rendering” is Dumb, but it works… for now?

So, it turns out Google pretty much relies on your sitemap to find things and all that stuff about following links is for computing page-rank and so on. You can have perfectly crawlable content and Google simply won’t index it if it’s not in a sitemap.

So, the dynamic rendering approach simply says Google will go look someplace else for a given page when crawling if you give it a secret handshake. (And now it says some time in the future it will stop doing the handshake—but then again document.execCommand is deprecated, right?) So all you need to do is, when you generate your sitemap, generate all your pages statically and hand those over when asked by a crawler. (There’s no chance anyone will abuse the hell out of this, right?)

The difference between this and SSR is that you can give actual users dynamic pages that send template, code, and data to the client, save cycles on the server, bandwidth everywhere, and are just better. All you need to do is render all your content as though it’s static and hand that over on demand. You can either do this using SSR to handle crawler requests and turn it off for users (which is tricky but not insurmountable) or, for smallish sites, just batch render the pages and stick them somewhere.

In Burger Chain terms, you just have a warehouse full of fiberglass burgers. Customers just get the recipe and ingredients, but if someone shows you their frequent crawler membership card, they get a fiberglass burger and—weirdly—they were OK with that, until they realized some burger chains were handing them beautiful fiberglass burgers but giving other customers dog vomit and decaying seaweed, which is bad because their entire business model is based on recommending burger chains to people.

Hydration, Rehydration, Partial Rehydration, Trisomorphic Rendering…

I can’t even.

You lost me at “we can’t load our code asynchronously because 2010”.

But OK, in this case, if you order a meal, they give you a fiberglass burger, a fiberglass coke, and a fiberglass fries. They then split the recipes and ingredients and so you can make them in the order you prefer or, maybe, the order they think you prefer.

So, how to fix this mess?

My naive assumption was, as long as the site loads fast and we do all the “right things” in terms of semantic structure, accessibility, and performance, we should be fine and Google will do its thing. Our app architecture was basically:

  • index.html is static, tiny, loads super fast
  • index.js is static, small, loads super fast
  • we load data as fast as we can, and render it really fast (and asynchronously)

My assumption was that assuming we do a solid job, Google will load our page, let the javascript run for a bit, and then index what it sees. Obviously lots of sites work like this and our code runs way faster and lighter than average so no big deal. Right? RIGHT?

Boy was I wrong.

So we spent a lot of time in Google’s Performance Insights and Lighthouse tools (well, when Lighthouse was working)—and I basically live in the Performance tab. I also frequently check everything on my late-model phone, my 2016 tablet, my PC laptop (it’s for gaming so not really a stress test), and even my Raspberry Pi 400. Everything was as small and cache-friendly and low-latency as possible. Our SEO score was 100 everywhere we checked. What could go wrong?

Well, outside the browser, the SEO tools (which shall remain nameless) we used to compare the old site (Django / React with SSR—which took ~6s to load a typical page) to the new site and we basically were getting loaded in 0.5-1s and then waiting on data for maybe 2.5s showed our “health” ranking went from mid-60s to 100 with a few minor tweaks. Hurrah. So we went live.

It seems that the SEO tools were showing improvements in our latency and mobile-friendliness in real-time, but relying on previous crawls for the actual SEO stuff. Meanwhile, once we went live Google cheerfully indexed our site and found our placeholder text and spinner and decided the pages had no unique content…

And then our traffic crashed…

Guess I should have paid more attention to that icky SEO stuff.

After thinking I had potentially destroyed the company and maybe we could recover if we turned the old site back on and then implement SSR as quick as possible, I got very depressed. Then, I realized one key thing—Google was seeing our spinner. And our spinner was being rendered asynchronously. So it wasn’t ignoring Javascript, it just wasn’t waiting on our—not slow, but not wicked fast—data fetches (it had been fine with the old server just taking 6s to send it a wad of HTML—good grief). I also knew that Google hosting makes it easy for you to redirect any url to a Cloud Function (which is how our sitemap works). So, I sketched out a plan for the next day, shared it with my fellow engineer, and went to bed.

Now, our home page is minuscule. It’s basically a stub with a link to some javascript. (Our CSS is rendered on-the-fly by the javascript which allows things like color-computation). It’s no huge win if it’s cache-friendly, the html is <2kB uncompressed. So if we could somehow prefetch data and stick it into the home page in response to the initial request, then render it once the javascript arrives (which is what renders the spinner) will Google index us properly?

My intuition was that we’d pay a small up-front latency cost for diverting the index.html request to a function (since we were redirecting from a “dumb” static host to a function server) so the user would get the initial pager slower, but not much slower, but then instead of a loading spinner the crawlers would see a “hydrated” page. We’d also be paying for function execution time and not just download bandwidth.

But we’re still rendering client-side with a single code-base, the front-end code hasn’t changed except for having our data request functions check to see if the thing that was requested is already in the prefetched global that has now mysteriously appeared. We’re still using the same data, but now critical stuff is in an inline <script> tag (a trick Facebook was using back in 2016—but it also rendered the whole damn page and inlined all the javascript).

Going back to the ToDo list example, we just serve the dynamic ToDo list, but we insert the data into a script tag in the <head> of the HTML stub, when the dynamic code asks for it, the request function sees it’s already available and just loads it instantly (but asynchronously!) The cost is index.html isn’t cache-friendly and is rendered by a function vs. just static data, the benefit is that you have critical data immediately, saving a request / handshake / response loop (and possibly some spinup time in the Firebase client code, which is—by far—the largest part of our javascript payload).

How well does this work?

Well, the simple answer is ~3s for fully-populated pages becomes ~0.6s for populated pages and 1.2s or so for fully-populated (“below the fold” stuff we didn’t prefetch). And Google indexes everything as you’d hope. It turns out there are virtuous circles at play:

Old World

  • index.html requested from static host, it’s <2kB
  • javascript is requested from static host, it’s ~300kB
  • page is rendered without data (“spinner”) — this used to occur at ~500-700ms
  • the Firebase client code spins up and connects and stuff
  • data requested from Firestore
  • page is rendered with data — this used to occur at 2500-3500ms

I should note that we weren’t happy with 2500-3500ms, and planned to address it, but it was 2x faster than the old site, which was displaying a white rectangle for 6s+ and Google’s crawler was just fine with indexing it.

New World

  • index.html is requested from static host—redirects to Cloud Function that fetches and inlines some data (<5kB)
  • javascript is requested from static host, still ~300kB
  • page is rendered with data — this occurs are 500-700ms

We expected a win, what I didn’t expect was that we Firebase can (a) intercept the request for index.html and route it to a cloud-function, (b) the function can fetch data and return the page, (c) and somehow get the browser to start loading the javascript before it has gotten index.html (so in the new world, step 3 is done before step 1 returns—and this is with cache disabled). Our “hydrated” page gets rendered before the fonts arrive… And, if anything steps (1) and (2) are at least as quick as just returning the—smaller—static index.html in the first place.

Even before we verified that Google’s crawler did in fact now index our pages properly, we were simply giddy with the improved performance.

Two weeks ago, when we were just naively doing dynamic rendering and assuming that Google and Bing would just “do the right thing” if we “did the right thing”, we had some plans to get our initial page render down to 1s on mobile by replacing Firestore data requests with Cloud Functions calls (which do the data calls faster because they’re in the data center and already authenticated, and because cloud functions can finely tune whether or not they care who’s asking). Prefetching data this way got us down to 500-700ms. The cost is rendering index.html but that’s it.

One of the interesting things in my mental model of “how computing stuff works” is realizing that cpu, bandwidth, and storage have different costs, but the hierarchy isn’t obvious. In this case, serving x content statically, possibly from the edge, and y content dynamically after x requests it turns out to be more expensive than serving x+y dynamically, at least where x is small.

When I was working on an ultra-light rewrite of, I could get my version to load my Facebook “wall” on the Facebook campus—from a dedicated 128 core server in Oregon over a T3 connection—in ~300ms (vs. ~1000ms, IIRC, for the then existing version).

These performance figures all come from testing on my (“free with the apartment”) home WiFi in Finland on a very-much-not-dedicated server. And the app isn’t hugely complicated, but it’s a lot more complex than my prototype. Again, you can load a simple page in <200ms, but as soon as you’re handling data requests, auth, and all that stuff you’re up to 400-500ms for “hello world” in ideal conditions.

And, by the way, this is with no real optimization anywhere (beyond the prefetch itself, of course).

So, our burger chain used to hand you the recipe over the counter, then wait for you to order the ingredients, and then send the ingredients from the distribution center.

Now we have the recipe and ingredients sent to you from the distribution center. That’s it. And because we don’t have to consider making fiberglass burgers, our recipes are easier to write and easier to follow.

We don’t make a fiberglass burger, we send less crap to you, you pay less for delivery, and you get to eat sooner.

tl;dr — how to get client-side rendered apps to index properly

So, to summarize, we were able to capture the SEO benefits of SSR (let’s not even discuss the other options) by prefetching data and adding it to our (minimal) HTML payload.

In general, when you make a good architectural decision you see a cascade of wins—a virtuous circle. SSR has one big win, but is otherwise a vicious circle of compromises—more complexity and more code to do the same thing. “Dynamic Rendering” is deprecated, wide open to abuse (probably why it’s deprecated), and dumber than SSR. And Google’s current recommended option is… ¯\_(ツ)_/¯.

So, Google, here’s a fourth option that is simple and works!

  • index.html is no longer a static file but instead has prefetched data inlined in it. Since we’re rendering it anyway, we also render all the stuff SEO people care about and thus can have a “single source of truth” for things like <title> <link rel="canonical"> and <meta name="description">, And we can be sure they’re what the crawlers see, because they’re there at the start and don’t change.
  • We send the same data as before, only fetched more quickly and earlier.
  • Static content, such as icons and javascript is still static.
  • We still render (once) on the client.
  • Aside from our service library (which now knows there may be prefetched data) no other client code needed to change, but we did actually strip out lots of complicated stuff designed to make the <title> <link> and <meta> stuff change based on user navigation, and in general it’s easier to prefetch data using our new system than to cleverly parallelize queries on the client to make things load fast, so it’s likely a win going forward. Even in the short term, our code bundle got slightly smaller and simpler. now loads fast and lean, renders client-side, and does everything it's supposed to (except for some tree-shaking)

This is like the virtuous circle xinjs (and b8rjs) deliver on the client—write less code, maintain less code, send less code, run less code, get more done. And it also solves the SEO problem client-side rendering approaches such as xinjs (and b8rjs) and web-components without SSR have had until now. now loads in ~550ms on desktop, ~900ms throttled for 4G, has 99-100 SEO scores on all our tooling, and it’s 100% dynamically rendered on the client-side. No thunks. No rehydration. The burger is fresh and juicy.

There is one caveat to add before I finish. Google makes some comments about flattening out the DOM and shadowDOM of web-components before indexing. One of the things that is perhaps unusual about our web-components is that most of them do not use the shadowDOM at all, and in any event all our content is in the “light” (i.e. regular) DOM. xinjs actually has a <xin-slot> custom-element that lets you composite in the light DOM exactly as you would in the shadowDOM without needing a shadowDOM.

So any weirdness caused by content being rendered in the shadowDOM simply doesn’t impact us. If you’re building web-components that ultimately end up rendering everything inside the shadowDOM—and just don’t—you may be in for a nasty surprise. I don’t know.

Time to go to bed.