Auto-refreshing pages and bandwidth optimization

I have the pleasure of managing the technology side of a fairly large-scale Web site for a traditional media company.  We do somewhere around 2.7 million pageviews on a normal day, and as many as 7 or 8 million on a busy day.

Obviously, with traffic levels like this, optimizing our bandwidth usage is critical.  We use a CDN to deliver “static” content like CSS, Javascript, and images.  The CDN offers cheaper bandwidth than our hosting provider, so we try to limit what we serve from the hosting provider to content that is not cacheable, like dynamic, customized HTML.

But just because the bandwidth is cheaper at the CDN doesn’t mean that it’s free.  We still want to minimize the amount of data we push through that network.

Obviously, a smart caching policy is the first place to start.  We use Apache’s mod_expires to set sensible expiration policies on content.  For example, we allow visitors to cache our CSS files for a week before their browser even needs to check to see if it’s changed.  

Despite such efforts, we have had trouble containing the traffic to the site.   When I performed a deep analysis of our CDN traffic and compared it to the number of pageviews and visitors, I found some disturbing results.

We were seeing extremely high numbers of requests for very cacheable content.  As an example, /site.css was being requested on average once for every 3 pageviews!  Considering the average user consumes somewhere in the neighborhood of 7 pageviews in a day, this didn’t make sense.

We have a pretty dedicated audience.  In other words, visitors will probably come to our site 8 or 9 times each month.   With a fully functioning browser cache (and assuming the cache never has to purge objects), such a user should only make 4 requests for /site.css in a month.  That would probably be 4 times for every 50-60 pageviews.  Much lower than the 29% request rate we were seeing.

It turns out that the culprit is the auto-refresh that’s built into our pages.  We use Javascript to reload the page every 10 minutes so that timely content like our current weather conditions and radar images are up-to-date for the visitor.  The mechanism we were using was flooding our CDN with requests for content that should have been cacheable.

This led to some investigation into various methods for refreshing pages.  There are three “levels” of reload available in Javascript:

window.location.reload (true)

The Location.reload() function takes an optional argument, forceget, which will cause the page to always be reloaded from the server.  This is exactly like holding Shift while clicking the reload button in Firefox.  Not only will it force the latest version of the page to be reloaded from the server, but every single object on the page will also be reloaded — images, CSS, Javascript, etc.

This method will cost you tons of bandwidth, because the browser will omit the If-Modified-Since header from its HTTP requests, and the server will return the full content of every object on the page.

window.location.reload ()

If you omit the forceget parameter (or specify it as false), the browser will reload the page, and it will send the If-Modified-Since headers in its requests.  This behavior is exactly the same as if the user clicked the reload button in Firefox (without holding the Shift key).  Note that while your browser will still use objects from cache and won’t actually download most of the objects on the page, it will still check the objects.  So if your page has 50 objects, you’re looking at 50 requests to the server, most of which will return 304 (Not Modified).

This can slow down the page reload for the user, and while a 304 response is much smaller than a 200, the bandwidth consumed adds up.

On our site, these requests amounted to about 10 Mbps of billable bandwidth (out of 70 or 80 Mbps).  This is substantial.

window.location = window.location

Rather than using the reload() function, you can force the browser to “revisit” the current URL.  This is akin to a Firefox user clicking at the end of the URL in the address bar and hitting “return”.  It’s also similar to browsing away from the page and browsing back by clicking a link back to the first page.

If your cache policy on your HTML prevents caching (and if you’re using dynamic, personalized HTML, you probably should be preventing caching), the HTML will be reloaded for the user, but the objects on the page will be pulled straight from cache.  The browser won’t even check for updates on objects that are within their cache periods.

The bottom line

window.location.reload() is expensive from a bandwidth standpoint no matter what you pass as the forceget option.  Avoid it if you’re sensitive to excessive bandwidth usage.



Keywords: content distribution network, cost, expensive, shift-reload, auto reload, auto refresh, cache

Leave a comment

Your email address will not be published. Required fields are marked *