A faster web | |
Author : Stephane Rodriguez Creation date : April 21, 2001. Topic : better use of the HTML programming language, part 1 Audience : standard, programmers Keywords : HTML language, W3C, tags, optimization, HTTP transport, web publishing, content management. |
The aim of this article is to explain how to reduce programmatically the size of HTML pages that are transferred through the entire network, 24-7, in real-time, and which are often regarded as responsible for the web being slow to surf.
Because any time you click in a web page a new HTML content comes to you, it is obvious that a specific attention should be taken to ensure that the HTML content is as small as possible, making it fast to transfer, parse, and browse on screen.
It is a misconception to think that the web is getting slower because more people connect and surf. At the same time, ISPs are starting to roll out their ADSL connections, which normally should give space to richer media such like musik and videos.
One of the reasons why the web is slow is not because of a so-called low end-user modem connection. When you logon to Yahoo!, are their pages that slow ?? No, it isn't. Yahoo! exemplifies a standard, clean, right-minded use of the web technologies. And this involves textual content with true meaning rather than many useless pictures, not to mention blinking stuff or animations. When the context is mostly textual, it's obvious to publishers that pages should not carry too many words, making room for page splits. But when publishers tend to aggregate as much content as possible in the homepages of portals, just because they think people won't otherwise surf their sites, that's when the problem arises. Due to the recent commercial nature of most web sites, publishers tend to accumulate ad banners, and by the way disregard the way their own pages look like and behave. Bad news, these days surfers are not waiting many seconds before they go somewhere else, say a faster site. A lost surfer is by the way a lost customer.
In the same time, Yahoo! has a range of load balanced web servers that ensure that, whatever the content, web servers respond and fast. Costly load balanced servers are part of an answer to a slow web, but that's only part of it. If the HTML content gets bigger and bigger, load balanced servers will have to be upgraded to keep the same level of quality of service. And upgrading means a lot of money.
Web publishers have already enough work in their daily content publishing, besides they are not necessarily computer specialists. Because of that, the web content is not "optimized", in that it doesn't meet recommendations from the W3C consortium. If this content was simply a corporate use, this would do no harm. But the trouble is millions of people are waiting for it. The internet has no law or enforcement that restrict usage to people understanding what can be done and what should not be done. That's right, but there is no good reason why software dedicated to content management and web content publishing have not focused on optimization. Due to this, when 10000 letters are needed in a web page to publish an article, publishers build pages with 80000 letters or more in it, to show the same level of information. They don't necessarily know they are doing bad. Publishers have high-speed connections, thus don't see the difference. Sometimes, they don't even care.
Think of it, that's 70000 useless letters (almost 90%) that a web server publishes. That's more time to process, and time is money. That's more workload to process, and workload is money. That's 70000 letters going through the network towards each web surfer willing to see the web page (can be millions of people a day). That's 70000 more letters that have to arrive before the web page starts on screen. That means more time to wait. That's 70000 letters that fill the bandwidth. The bandwidth is money. This money is your money. That's 70000 more letters that the web parser (a component of your web browser) has to process before it shows the web page on screen. Again, that's wasted CPU time.
Everything added, those 70000 letters look like a total waste. Doesn't they ?
70000 letters look huge and unrealistic. But that's real, that's what you pay for colorful content.
Now imagine that's facts, not simple ideas. Imagine that each web page has as many as 90% of useless content, content that don't have any relevance to the meaning of the web page.
Also imagine that what we have said about textual content is also ways true about pictures, which not only are often unnecessary but also can be much more costly : we are not going to detail it in this document, but the 70000 letters' case is candy compared to the ugly hundreds of kilobytes behind pictures on each web page. For those who care, please consider unchecking the Internet options "load images and animations"....and forget about web sites using mostly pictures rather than simple fast meaningful text.
What do we mean with "doesn't meet recommendations from the W3C" ? In the early 90s, the W3C consortium has started to publish recommendations on standards such like HTTP and HTML, and evolutions of HTML have been proposed to better fit the ever growing need to publish web content. For instance, HTML 4.0 has introduced an enhanced way to write content, not only separating content and layout, making it both reusable by the way, but also and most importantly factorizing declarations and uses to avoid redundancy, duplication and so on. Doing so, publishers are able to use one single electronic pen as many times as needed, rather than taking one, using it, and then shortly being forced to throw it away, take another and so on. In other words, factorizing reduces heavily the length of content, hence HTML pages are smaller to publish, send and browse.
Though HTML 4.0 was introduced in 1997, compliant software editing tools have yet to come. Is that because achieving factorization is too hard ? Is that because publishers regard HTML 4.0 as too difficult to handle ? Let's face it, web publishing with HTML 4.0 doesn't have to be more difficult. There can be automated software routines that transform and replace old messy content by optimized HTML 4.0 content. The trick can be done without publishers noticing anything new.
That sounds probably overly optimistic as a view, but that also looks natural in the way content is regarded as processed from A to B, then from B to C, and so on.
Dealing with pictures, it's like publishers would publish a 500x500 pixels wide photograph, while the final reader would only see a small fraction of it, say 150x150. The publisher does not have to care, but again this produces side effects such like wasted time, bandwidth, and CPU load on both sides. At a time e-business people talk about customer relationship management (CRM) and personalization, why wouldn't it be the time to educate ourselves and switch to better suited use of technologies ?
The idea is to show how simple manipulations can produce amazing effects on content. It should be understood that benefits are not only in terms of content size, which is the issue, but also in terms of content reuse. Manipulations can be undertaken either while producing content, or at the end of the production line just before the content is sent ready to use.
There are many tricks, ideas. We are going to describe a couple of tricks to reduce the overall size of the content of a table, of course without loss of content and layout.
Let's begin with a raw table made of 2 rows :
<table width="88%" border="0" align="center"> <tr bgcolor="#000000"> <td width="36%"><b><font color="#FFFFFF">Site</font></b></td> <td width="10%"><b><font color="#FFFFFF">1st byte</font></b></td> <td width="11%"><b><font color="#FFFFFF">Last byte</font></b></td> <td width="13%"><b><font color="#FFFFFF">Size (bytes)</font></b></td> <td width="13%"><b><font color="#FFFFFF">nb images</font></b></td> <td width="17%"><b><font color="#FFFFFF">nb anchors</font></b></td> </tr> <tr> <td width="36%" bgcolor="#FFFFCC"><font size="-1">http://www.01net.com</font></td> <td width="10%" bgcolor="#FFFFCC"><font size="-1">1.81</font></td> <td width="11%" bgcolor="#FFFFCC"><font size="-1">19.11</font></td> <td width="13%" bgcolor="#FFFFCC"><font size="-1">80995</font></td> <td width="13%" bgcolor="#FFFFCC"><font size="-1">228</font></td> <td width="17%" bgcolor="#FFFFCC"><font size="-1">165</font></td> </tr></table>
This table was obtained with Macromedia Dreamweaver, a WYSIWYG HTML editor. As anyone will see in what follows, WYSIWYG has side effects.
This table looks fine, and indeed it is! It is compliant with HTML and can be pasted into a content editor. Each row defines the same style for all columns. What we see is that if, for any reason, we have to update the content and remove all <b> tags, we'll have to do the removal on many items before everything is cleaned. And if we have to add a new style, we again will have to do one thing on many items : a lot of work, also can lead to typing bugs.
What if repeated "patterns" could be replaced by a single instance ? Wouldn't this allow to reduce the size of the content ?
To do this, let's define a style (this requires a 4.0+ compliant web browser, that is a web browser which came in the market later than 1997, shouldn't be a big constraint by the way) and use it :
<style>.text255_bold { color:#FFFFFF; font-weight:bold;}.fontm1 { font-size:smaller;}</style><table width="88%" border="0" align="center" class="text255_bold"> <tr bgcolor="#000000"> <td width="36%">Site</td> <td width="10%">1st byte</td> <td width="11%">Last byte</td> <td width="13%">Size (bytes)</td> <td width="13%">nb images</td> <td width="17%">nb anchors</td> </tr> <tr class="fontm1"> <td width="36%" bgcolor="#FFFFCC">http://www.01net.com</td> <td width="10%" bgcolor="#FFFFCC">1.81</td> <td width="11%" bgcolor="#FFFFCC">19.11</td> <td width="13%" bgcolor="#FFFFCC">80995</td> <td width="13%" bgcolor="#FFFFCC">228</td> <td width="17%" bgcolor="#FFFFCC">165</td> </tr></table>
The HTML content looks thiner already. But that's only the first transform of a series of 4 transforms. Using the same idea, it's possible to reduce the amount of bgcolor instances, if we see that bgcolor can be declared at the row level, within the tr tag. And this transform doesn't require 4.0+ web browsers.
It does this :
<style>.text255_bold { color:#FFFFFF; font-weight:bold;}.fontm1 { font-size:smaller;}</style><table width="88%" border="0" align="center" class="text255_bold"> <tr bgcolor="#000000"> <td width="36%">Site</td> <td width="10%">1st byte</td> <td width="11%">Last byte</td> <td width="13%">Size (bytes)</td> <td width="13%">nb images</td> <td width="17%">nb anchors</td> </tr> <tr class="fontm1" bgcolor="#FFFFCC"> <td width="36%">http://www.01net.com</td> <td width="10%">1.81</td> <td width="11%">19.11</td> <td width="13%">80995</td> <td width="13%">228</td> <td width="17%">165</td> </tr></table>
Now for the 3rd transform. When using a table, if the first row clearly defines the width of all columns, it's not needed to put the same information for all other rows . This allows to remove the width attribute from all rows except the first. The effect of this can be huge as it applies to all rows. Imagine a table with 150 rows. Let's remove the width attribute :
<style>.text255_bold { color:#FFFFFF; font-weight:bold;}.fontm1 { font-size:smaller;}</style><table width="88%" border="0" align="center" class="text255_bold"> <tr bgcolor="#000000"> <td width="36%">Site</td> <td width="10%">1st byte</td> <td width="11%">Last byte</td> <td width="13%">Size (bytes)</td> <td width="13%">nb images</td> <td width="17%">nb anchors</td> </tr> <tr class="fontm1" bgcolor="#FFFFCC"> <td>http://www.01net.com</td> <td>1.81</td> <td>19.11</td> <td>80995</td> <td>228</td> <td>165</td> </tr></table>
The HTML content looks thiner again. Now for the last transform, let's remove all unnecessary end of lines. Why do this ? first of all because end of lines are useless for web browsing, and require additional HTML parsing. Although an HTML content looks human readable with end of lines and indentations, it is unnecessary : why would the audience see it ? and even why should the audience understand it ? And if anyone shows up the HTML source of this web page in his editor, why should he be allowed to crack the content or code ? No way. All of this is solved by a reversible mechanism known as lossless compression.
The final layout of the equivalent HTML content is :
<style>.text255_bold { color:#FFFFFF; font-weight:bold;}.fontm1 { font-size:smaller;}</style><table width="88%" border="0" align="center" class="text255_bold"><tr bgcolor="#000000"><td width="36%">Site</td><td width="10%">1st byte</td><td width="11%">Last byte</td><td width="13%">Size (bytes)</td><td width="13%">nb images</td><td width="17%">nb anchors</td></tr><tr class="fontm1" bgcolor="#FFFFCC"><td>http://www.01net.com</td><td>1.81</td><td>19.11</td><td>80995</td><td>228</td><td>165</td></tr></table>
Now for figures :
Stephane Rodriguez, April 21, 2001.
...to be continued on part II.