XML statistics

Help on Nb total elements :
This indicator is the amount of Xml elements in the stream. An element is one of the elementary components of an Xml stream : it is scoped with < and > and may have a companion end-tag such like in the example below. Elements have zero, one or more attributes.
In this Xml portion, Bookstore, Book and Title are elements. Bookstore is the root element of the Xml stream.
Hint : the content declared within an element can be empty : in this particular case, one is encouraged to use the empty tag syntax instead of end-tag, for performance reasons including size and parsing speed. See below. The element hierarchy is equivalent to the structure of a database.

Empty tag : <Book Genre="Thriller" In_Stock="Yes"/>
End-tag   : <Book Genre="Thriller" In_Stock="Yes">The Round Door</Book>

Close this one

Help on Nb total attributes :
This indicator totalizes the amount of attributes in the stream. Attributes can be regarded as details of a given element and are part of its scope. Attributes are stricter in the Xml syntax than they are in the Html syntax in that they must be assigned to a value, even when the value is empty.
Hint : contrary to an element, an attribute can appear only once at most in the scope of an element. There is no declarable datatype for an attribute, though attributes may be declared in the DTD (Document Type Definition). Attributes take less "space" in the stream compared to elements because they have no end-tag counterpart, hence attributes are good candidates for 1-cardinality element-to-element relations.
In this Xml portion, Genre and In_Stock are attributes of the Book element instance.
Close this one

Help on Nb total comments :
This indicator totalizes the amount of comments in the stream. If there is at least one comment in the stream then the report also figures out the ratio of size due to comments wrt total size of the stream, allowing to check out whether or not there is a significant amount of comments. A comment in an Xml stream follows exactly the same syntax than an HTML comment : it begins with . See below. Comments are not parsed so they are a good place for unformal explanations.
Hint : use comments to explain meanings, but don't overuse comments to avoid the stream have a significant portion of it that does not bring any real value to the client Xml application. This does not mean however that there should be say 'light Xml streams' filtered of any kind of comment, and 'standard Xml streams' or intermediary Xml streams that have comments and a few other things.
Close this one

Help on Nb total CDATA sections :
This indicator totalizes the amount of CDATA sections in the stream. If there is at least one CDATA section in the stream then the report also figures out the ratio of size due to such sections wrt total size of the stream, allowing to check out whether or not there is a significant amount of it. A CDATA section in an Xml stream is much like a comment. The only difference in behavior is that a given Xml client application may react differently with CDATA sections than comments, so that's up to software developers to tell, not Xml itself. Generally speaking, comments are seen but not read by Xml client applications, unlike CDATA sections. The syntax is new to the web developer : it begins with <![CDATA[ followed by any content including carriage returns, and it ends with ]]>. See below.
Hint : use CDATA sections as little as you can. One reason for this is that they don't follow the Xml syntax, thus are not useful to its comprehension.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore>
   <![CDATA[ J&R Booksellers Database ]]>
   <Book Genre="Thriller" In_Stock="Yes">
      <Title>The Round Door</Title>
   </Book>
</Bookstore>

Close this one

Help on Nb total process instructions :
This indicator totalizes the amount of process instructions in the stream. A process instruction is a proprietary CDATA section in that it is meant to have a special effect when read by a given Xml client application, hence the words "process instructions". For all Xml parsers, its default meaning is simple comment. The syntax is new to the web developer : it begins with <? followed by an instruction name followed by any content including carriage returns, and it ends with ?>. In the example below, the xml statement in the prolog exhibits a general purpose process instruction which is part of the Xml standard itself, and is used to declare the encoding charset.
Hint : use process instructions as little as you can. In the real world, process instructions are barely used because the specifics should always lie on the client implementation, not on the Xml stream itself.
Close this one

Help on Nb total namespaces :
This indicator totalizes the amount of namespaces used in the stream. A namespace exemplifies the use of certain name prefixes in Xml elements, Xml attributes and even content. Namespaces where first meant to solve conflicts between two Xml streams (from two different companies) that used Xml structures that had the same element or attribute names. But as far as we are, conflicts are not solved, and probably won't ever because if there's any company that takes the lead on meanings of given elements, then Xml will not be standard any longer. Just a proprietary format like any kind of closed-source binary format. Thus not interoperable. For instance, instead of declaring book, title, genre elements and so on, an authority can declare once for all a bookstore prefix which thus gives a special meaning to book:book, book:title and book:genre. A namespace is declared by a special attribute of the form xmlns:<something> where <something> is any prefix. To "use" the namespace, you simply prefix elements, attributes and even content with this prefix. See below. Namespaces can also be seen as a great way to add data type definition to the DTD (Document Type Definition), especially to better handle dates and currencies. But only a dedicated Xml client application can take advantage of a given namespace. There are too many implications to be covered here. Namespaces is a powerful thing but it completely wrecks the Xml standard itself by hiding meanings behind prefixes. The essential work since 1997's Xml has been to popularize namespaces such like Xml schemas, in favor of the biggest companies such like Microsoft or Oracle.
Hint : do not use namespaces because it has an overhead on the stream size. Each prefix requires a given small amount of bytes. When you sum up all this, this can be significant. This is revealed in this report.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore xmlns:book="http://www.microsoft.com/namespaces/monopoly/book">
   <-- J&R Booksellers Database -->
   <book:Book book:Genre="Thriller" book:In_Stock="Yes">
      <book:Title>The Round Door</book:Title>
   </book:Book>
</Bookstore>

Close this one

Help on Structure Pattern :
The purpose of this indicator is to reveal what the Xml stream is most made of. So far, no one tool in the software industry has such an approach to an Xml structure. Inspecting patterns is a key process in order to understand why an Xml stream is so large in size.
Additionally to the structure, this indicator infers data types associated to elements and attributes. Among recognized data types are : float numbers, integral numbers, currencies, dates, urls and emails. An element or attribute with no associated data type is a pure character string with no particular data type inside. Again the process of automatically extracting data types helps a lot in order to for instance : propose a replacement for a date data type if it has unnecessary time fields thus large in size, validate content against a known data type, facilitate Xml transforms (automatic mapping).
Hint : because the pattern is what the Xml stream is most made of, a special attention should be paid so it keeps the lowest profile in terms of size. It is straight forward that the smaller in size the pattern is, the smaller in size is the Xml stream.
Below is a full Xml document sample along with its extracted pattern.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore>
   <!--J&R Booksellers Database-->
   <Book Genre="" In_Stock="Yes">
      <Title>The Round Door</Title>
      <Author>Tom Evans</Author>
      <Year_Published>1996</Year_Published>
      <ISBN>0-9546-0274-3</ISBN>
      <Price>$23.00</Price>
      <Review>An Intriguing Tale Of A Round Door In A Wall</Review>
   </Book>
   <Book Genre="Non-Fiction" In_Stock="Yes">
      <Title>Creating Real Xml Applications</Title>
      <Author>Bill Eaton</Author>
      <Year_Published>1998</Year_Published>
      <ISBN>7-4562-0167-8</ISBN>
      <Price>$35.00</Price>
      <Review>A Look At How To Build Real Xml Applications</Review>
   </Book>
   <Book Genre="Fiction" In_Stock="No">
      <Title>Over The Hills Of Yukon</Title>
      <Author>Bert Colewell</Author>
      <Year_Published>1993</Year_Published>
      <ISBN>5-6524-3054-1</ISBN>
      <Price>$22.00</Price>
      <Review>A Warm Story About A Man And A Moose In Yukon</Review>
   </Book>
   <Book Genre="Fiction" In_Stock="Yes">
      <Title>The Lion's Gold</Title>
      <Author>Daphne Griswald</Author>
      <Year_Published>1989</Year_Published>
      <ISBN>6-7896-2498-2</ISBN>
      <Price>$15.00</Price>
      <Review>One Of The Most Compelling Books Since "The Tiger's Silver".</Review>
   </Book>
</Bookstore>

Extracted Pattern : (please note end-tags are not shown)

  <Book Genre In_Stock>
   <ISBN(=number)>
   <Year_Published(=number)>
   <Review>
   <Price(=currency)>
   <Title>
   <Author>

Close this one

Help on Distinct patterns :
This is the amount of distinct patterns in the Xml stream. Most of the time, there is a single pattern that is repeated many times in the stream. But, much more rarely, and depending on design there may be more than one.
We are dealing with structure patterns, ie combinations of elements and attributes, regardless to the actual content enclosed by the structure. An indicator for the content can be found elsewhere in this report. Close this one

Help on Pattern occurences :
This is the amount of times the main pattern occurs in the Xml stream. The higher, the more "tabular" the stream is.
By main pattern we mean the most seen of the patterns that have been detected. Most of the time, the main pattern is exactly the single pattern detected in the stream. Close this one

Help on Pattern height :
This is the amount of lines top to bottom for the main pattern. This figure does not include every carriage returns, especially those from the content itself, thus there may be a slight difference between the number in the report and what you'll see by counting it yourself.
By main pattern we mean the most seen of the patterns that have been detected. Most of the time, the main pattern is exactly the single pattern detected in the stream. Thanks to that, as an observation the pattern height times the pattern occurences approaches greatly the total stream amount of lines (elsewhere in the report). Close this one

Help on Pattern size :
This is the size in bytes of the main pattern, regardless of the content it encloses. By main pattern we mean the most seen of the patterns that have been detected. Most of the time, the main pattern is exactly the single pattern detected in the stream. Why is this figure important ? that's because that's a basis to compare with a tremendous potential gain obtained by flattening the pattern structure, thus leading to a new flatten pattern size. Close this one

Help on Flatten Pattern size :
This is the size in bytes of the main pattern, once flatten, regardless of the content it encloses. By main pattern we mean the most seen of the patterns that have been detected. Most of the time, the main pattern is exactly the single pattern detected in the stream. This figure compares with the original pattern size in order to show a tremendous gain on size, for the pattern instance itself, and of course for the overall stream. Below are explanations on the process of flattening patterns.

Flattening patterns is the process of replacing candidate elements with attributes, in order to avoid the overhead effect of end-tags in the size of the stream. What are candidate elements ? that's elements with 1-cardinality, that is sharing an elementary and single relation with another element or attribute. The Xml document below models a person name, and is subsequently flatten :

Original Xml sample :

 <person>
  <firstname>John</firstname>
  <lastname>Lepers</lastname>
 </person>

Modified Xml :

 <person firstname="John" lastname="Lepers"/>

which leads to a 39% gain on size just for this. When the process is performed on the whole stream, this produces a predictable gain which is figured out as the flatten pattern gain, see next indicator. Close this one

Help on Flatten Pattern Gain :
This is the gain in percent as a result of flattening all patterns in the stream. A flatten pattern always reduces (in worst case it equals) the size of an original pattern because by design it removes end-tags. For more details, see Flatten Pattern size above.
The tremendous gain obtained by such a process calls for critics. Indeed, the Xml structure is different, so it is likely to break rigid client applications that expected to see the original structure. As a rule, it must be said that this process is not actually performed on the input stream. It is simply suggested. It is suggested that, if those who design or generate Xml streams pay attention to this particular point, then they may have benefits in several areas of performance including size and parsing speed.
Also as a sidenote for all transforms suggested in this report, one can question rigid Xml-based client applications that would stop from working any longer if the incoming Xml streams are slightly changed in their structure. First of all, the structure is either described by a DTD (Document Type Definition) or by any combination of namespaces, and as long as the client application simply parses keywords and doesn't assume things on element hierarchies, then client applications will continue working fine. On the contrary, those client applications which have hardcoded hierarchies within C++ code will fail, but that is more a matter of questionable code design on the client side, than Xml overuse.
Otherwise this gain can be cumulated to other gains described elsewhere in this report. Close this one

Help on Structure Depth :
The purpose of this indicator is to reveal how deep are structured the elements in the flood. One of the key interests to it is that instead of focusing on one element, such like the deepest in the stream, the indicator provides all relevant statistical figures that help understand the design of the Xml stream without even actually viewing it on screen. In fact, not only this kind of indicator fills particularly well the gap of statistical figures on Xml hierarchies, but it helps to extract key design. Automatically building the structure (kind of reverse engineering) has the important purpose of showing how well or may be how ill-suited an Xml stream is wrt optimization, design, and parsing performance.
The depth itself is an obvious measure. In the sample Xml stream below, the Bookstore element is at depth 1, while Book is at depth 2 (and so are its attributes), and finally Title is at depth 3.
Hint : The deeper an element is, the more it needs closing tags to go up and back to depth 1, hence the bloat in size. Close this one

Help on Depth Max :
This indicator shows the highest element depth in the Xml stream. It reveals by itself how "oblique" is the structure in terms of pure viewing metaphore. If the highest depth is low, this doesn't always mean that the Xml stream is "vertical" because the distribution may have most of its data very far from the highest depth. That's where the Depth Mean, and Depth Standard Deviation have their words to say, along with the vivid Depth histogram chart.
On the contrary, if the highest depth is high, over 10 for instance, you know that you are dealing with a very original Xml stream ! At least, the indicator can alert you if the Xml stream processed is quite unstructured, that is not tabular at all. This fairly indicates that this Xml stream does not follow the usual trends and is likely to experience difficulties in beind integrated in a supply chain.
In the depth histogram chart, the depth max is along the horizontal axis. Close this one

Help on Depth Mean :
This indicator shows the average element depth in the Xml stream. It reveals how hierarchical or tabular is the structure of the Xml stream. If the mean is low, the stream is tabular, ie looks like a standard table. If the mean is high, the stream is likely to be hierarchical. If the standard deviation is low, say below 1, the stream is tabular but has much of its structure at a high depth. If the standard deviation is high, the stream is hierarchical because the distribution is uniform wrt the depth.
In the depth histogram chart, the depth mean is along the horizontal axis. Close this one

Help on Depth Standard Deviation :
This indicator shows how is distributed the Xml structure wrt the depth measure. If the standard deviation is low, say below 1, the structure is compact and stays around the mean depth. If the standard deviation is high, the distribution is proportionally uniform and flat, the stream thus looks much hierarchical.
NB: the standard deviation is mathematically speaking the 2nd order moment of the depth distribution. Close this one

Help on Occurences of depth of an element :
This indicator shows the amount of occurences of an element at a given depth. If you see 10 for depth 1, it means that out of all elements in the Xml stream, 10 of them are at depth 1. Among these 10, you may find for instance 4 <Book>, 5 <Supplier> and 1 <Order>.
In the depth histogram chart, the depth axis is the horizontal axis. The vertical axis is the amount of occurences of a given depth. Close this one

Help on Listing of elements at a given depth :
The element names enclosed with < and > chars are those that appear in the Xml stream at a given depth.
Close this one

Legend :
- horizontal axis : it is the depth axis. It of course starts with 1 on the left because the root element is always at depth 1. And it ends at any depth on the right, where you find the max depth indicator.
- vertical axis : it is the amount of occurences of elements at a given depth. In any Xml stream, there are zero, one or more element instances at a given depth. There is exactly one occurence of a root element, so you will always find 1 at depth 1. There can be any other combinations for other depths. Of course, if there is an element living at depth 5, there must be at least one element living at depth 2, 3 and 4 : depth 5 must be reached! The (max,avg,min) labels are not related to the indicators above : they are figures for the occurence series, not the depth series.
Close this one

Help on Structure node naming strategy :
This strategy allows to reduce drastically the size of the structure nodes, hence the stream itself, and is ready for solving several performance issues including size and parsing speed.
As a definition, a node is either an element or attribute.
Element and attribute names are usually chosen so they are self-descriptive. While this looks like an advantage over binary formats, it has an overhead on size just because even in English, keywords enclosing content take statistically a significant space, resulting to a great contribution to the overall stream size. This can be avoided by enforcing a new strategy on naming described below.
The process of choosing a names for all nodes of an Xml structure is based on what is allowed by the W3C Xml recommendation itself. In other words, an element or attribute is any combination of letters and digits. With that in hand, why not make these names as short as possible ? Let us take an example with the now wellknown bookshop Xml sample :
Let's build a map of name pairs:

 Bookstore  becomes A
 Book       becomes B
 Genre      becomes C
 In_Stock   becomes D
 Title      becomes E

So we get the following equivalent Xml document :

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop_A.dtd">
<A>
   <!-- J&R Booksellers Database -->
   <B C="Thriller" D="Yes">
      <E>The Round Door</E>
   </B>
</A>

which helps to reduce by 41 bytes this simple structure, or 33%.
Again this transform raises the requirement to change the structure of the Xml stream, as explained for other potential gains such like obtained by flattening patterns (see above in this report). This gain can be cumulated to other gains described elsewhere in this report. Close this one

Help on Nodename minimum size :
This indicator shows the minimum size of all element and attribute names, in bytes (UTF8 encoding), regardless of the content they enclose. If the value is high, it means that the description is likely to be intuitive and human readable, but the designer didn't care much about the overhead produced on the overall stream.
Hint : do as much as you can to keep this value low. The node naming strategy can help you stick with a minimum of 1-byte, hence promote stream size reduction. Close this one

Help on Nodename maximum size :
This indicator shows the maximum size of all element and attribute names, in bytes (UTF8 encoding), regardless of the content they enclose. If the value is high, it means that the description is likely to be intuitive and human readable, but the designer didn't care much about the overhead produced on the overall stream.
Hint : do as much as you can to keep this value low. The node naming strategy can help you stick with a low maximum, hence promote stream size reduction. Close this one

Help on Nodename Mean size :
This indicator shows the average element and attribute names in the Xml stream. It's a key descriptor of the nodename distribution, along with the standard deviation. If the value is high, bad luck, you have a great overhead in your Xml stream.
Hint : do as much as you can to keep this value low. The node naming strategy can help you stick with a low maximum, hence promote stream size reduction. Close this one

Help on Nodename Standard Deviation :
This indicator shows how is distributed the Xml structure wrt the names of element and attribute names. It's a key descriptor of the nodename distribution, along with the mean. If the standard deviation is low, say below 1, the structure is compact and stays around the mean. In other words, names are equivalent in size in the overall structure. On the contrary, if the standard deviation is high, the distribution is proportionally uniform and flat, which means that names have quite different sizes. Questionable.
NB: the standard deviation is mathematically speaking the 2nd order moment of the name size distribution. Close this one

Help on Occurences of Nodenames :
That's how many times element or attributes whose names have the given length appear in the Xml stream. For instance, that may be how many times the <book> element and year attribute appear. In this case, we should watch for 'Occurences of nodenames 4-byte sized' in the report.
In front of it is the resulting size fraction of the structure, also presented as a percentage. The size in bytes for an element is not always times x namesize, because some elements have end tags, thus we must double its contribution in this case. Close this one

Help on Listing of elements and attributes with a given name size :
The node names enclosed with ( and ) are those that appear in the Xml stream with a given name size.
Close this one

Help on Total size of structure :
The total size of the structure is the sum of bytes in the Xml stream made of element and attributes, regardless of content itself : in other words, the total size of this structure :

<Book Genre="Thriller" In_Stock="Yes">
  <Title>The Round Door</Title>
</Book>

is the same than this one :

<Book Genre="" In_Stock="">
  <Title></Title>
</Book>

In front of the size in bytes is the equivalent in percent over all the stream size.
Hint : if this percent is significant, say above 30%, then you may just question the structure of the Xml stream wrt its ratio of real information vs garbage.Close this one

Help on Nodename Gain :
This is the gain in percent as a result of enforcing the node naming strategy, as described above. In front is the resulting size in bytes of the sole structure once updated.
This gain can be cumulated to other gains described elsewhere in this report. Close this one

Legend :
- horizontal axis : it is the nodename size axis. It of course starts with 1 on the left because an element or attribute is at least 1 byte. Most Xml streams are within the range [1,12] bytes.
- vertical axis : it is the amount of occurences of elements and attributes with a given name size.
Close this one

Help on Structure attributes :
The Structure attributes indicator reveals how uniformly attributes are dispatched. Besides the standard amount of attributes per element (with min, max, mean and standard deviation) is the disorder ratio.
The disorder ratio attempts to show if attributes are listed in the same order or not wrt element instances. That's of course an average, because each element may have any amount of associated attributes. According to the W3C Xml norm, there is no special ordering between attributes, it is simply a good habit to have attributes always following the same order. Close this one

Help on Min Attributes per element :
This is the minimum attributes amount over all elements in the Xml stream. If one element does not have attribute, then this is 0. Close this one

Help on Max Attributes per element :
This is the maximum attributes amount over all elements in the Xml stream.
If this is low, say below 2, thats because the Xml stream is much "vertical", uses element hierarchies instead of attributes. This may be ok, but with a serious overhead on stream size. Close this one

Help on Mean Attributes per element :
This is the average attributes amount over all elements in the Xml stream.
If this is low, say below 2, thats because the Xml stream is much "vertical", uses element hierarchies instead of attributes. This may be ok, but with a serious overhead on stream size. Close this one

Help on Standard Deviation Attributes per element :
This indicator shows how is distributed the Xml structure wrt the attribute amounts per element.
This is 0.00 of there is no attribute at all in the Xml stream. If the standard deviation is low, say below 1, it means that attributes are uniformly dispatched between attributes.
NB: the standard deviation is mathematically speaking the 2nd order moment of the name size distribution. Close this one

Help on Attributes disorder ratio :
This indicator, expressed in percentage, shows how are unordered the attributes.
The higher, the more unordered. That is 0.00 if there is no attribute at all in the Xml stream.
Close this one

Help on Structure Namespaces :
Namespaces are declared by using a special attribute of the form xmlns:supplier="http://www.namespaces.com/supplier" and refers to a set of element and attribute names with a dedicated semantic meaning. Element and attributes with namespaces are prefixed by the namespace, for instance supplier:orderID. Namespaces are not required in Xml streams. They give Xml streams special meanings and may simplify data binding, as long as namespaces and their underlying meanings are made public and available to everyone. Any number of namespaces can be used, not only one. A namespace must always be declared before it is used. The URL used for the declaration is a fake URL here just for global uniqueness purpose. Below is a sample for the supplier namespace :

<?xml version="1.0" encoding="ISO-8859-1"?>
<Orders xmlns:supplier="http://www.namespaces.com/supplier">
   <Order date="AA/45/10" supplier:id="UIYBAB47KDIU75">
      <Id>NBZYSJSGSIAUSYGHBXNBJDUIUYE</Id>
  </Order>
</Orders>
\n");

Close this one

Help on Use of namespaces :
This indicator shows the ratio of use of namespaces in element and attribute names. If no namespaces are used, that's N/A, and there is no accompanying list of namespaces. Close this one

Help on List of namespaces :
That's the enumeration of namespaces used in the Xml stream. Because namespaces are not required in Xml streams, this listing may end up to be empty. Close this one

Help on Raw Content :
Although the previous section was dealing with structure-only indicators, this section is entirely devoted to the content in the Xml stream, that is values enclosed by elements, and attribute values.
As content is anywhere in the Xml stream, it is suitable that indicators follow a relevant rule : this rule is that we mostly compare content values for a given element, resp. for a given attribute. That's what we call a LOV (List of values) or simply a column. Below is a sample :

  ...
  <name firstname="John" lastname="Fitzgerald"/>
  <name firstname="Matt" lastname="Kassv"/>
  <name firstname="Steven" lastname="Witcold"/>
  <name firstname="Thomas" lastname="FukSiebl"/>
  <name firstname="John" lastname="Smith"/>
  ...

In this sample, there are two elementary List Of Values : the one with all firstname attribute values, the one with all lastname attribute values. There are obvious relations inside each List Of Values. For instance, in firstname, John appears twice so we may come up with a value duplication rate. Such figures stress the need to factorize content to avoid size overhead. Of course, indicators do not go in a such high level of details such like a List Of Values. Indicators are averages or ratios over the whole Xml stream. Close this one

Help on Content minimum size :
This is the minimum size in bytes of a value in the Xml stream.
It uses the UTF8 encoding charset, so it is roughly half the size of the Unicode equivalent. If there is at least one occurence of an empty tag (either element or attribute), then this is 0. If this figure is high, say above 20, then one may question whether Xml is a suited format. Close this one

Help on Content maximum size :
This is the maximum size in bytes of a value in the Xml stream.
It uses the UTF8 encoding charset, so it is roughly half the size of the Unicode equivalent. If this figure is high, say above 20, then one may question whether Xml is a suited format. Close this one

Help on Content mean size :
This is the average size in bytes of values in the Xml stream.
This figure reveals the distribution of content size in the stream. If this figure is high, say above 20, then one may question whether Xml is a suited format. Close this one

Help on Empty element tags :
This is the ratio of empty element tags in the Xml stream.
An empty element tag is an element with no content inside. There are several syntaxes including :

First syntax for an empty element tag

  <book></book>

Second syntax for an empty element tag

  <book>
   <title>The Round Door</title>
  </book>

Third syntax for an empty element tag

  <book/>

In both cases, the associated ListOfValues of the book tag is empty.
An empty element tag is not an element missing at least one value somewhere. An empty element tag is an element with no content in all occurences.
If there is a significant amount of empty element tags, then one may question why elements are used, instead of simple attributes, thus avoiding overhead in size. However, empty element tags may add some form of value on the relationship with other element tags, especially hierarchy.Close this one

Help on Empty attribute tags :
This is the counterpart of Empty element tags. This is the ratio of empty attribute tags in the Xml stream.
An empty attribute tag commonly refers to an attribute where none of its occurences have a value. Below is a sample:

Sample of what's an empty attribute tag, here birthdate

 <person name="Smith" birthyear=""/>
 <person name="Zergov" birthyear=""/>
 <person name="Nelson" birthyear=""/>

Sample of what's NOT an empty attribute tag

 <person name="Smith" birthyear=""/>
 <person name="Zergov" birthyear="1980"/>
 <person name="Nelson" birthyear=""/>

If there are empty attribute tags, then one may question why they are used. Close this one

Help on Multiple part values :
This ratio stresses the significance of values in element tags that are splitted in several parts.
A multiple part value, is either a multiline value, or a value disseminated between more than one element. A multiple part value is always related to an element value, not an attribute value (see W3C Xml norm). Below are two samples of multiple part values :

  <book>
   The name of this book is so inadequate for a general audience
   that it has been decided not to print it.
  </book>
  ...
  <book>The Round Door
   <year>1999</year>
   <price>20$</price>Part II
  </book>

The actual value for the second sample is "The Round Door Part II". Close this one

Help on Content correlation :
The content correlation is an in-depth examination of List Of Values that reveals valuables things. The first indicator is related to duplication, or how often the same values appear again and again. The second indicator is a ranking, it shows the most seen value in all List Of Values. Close this one

Help on Max Content Duplication :
This ratio extracts from the set of List Of Values what is the highest level of duplicated values.
If the ratio is high, that's because at least one List Of Values has its values highly correlated, thus candidate for factorization.
The maximum ratio works with other indicators such like Mean and Standard Deviation to show the distribution of correlation in the whole set of List Of Values. Close this one

Help on Mean Content Duplication :
This is the average duplication in the Xml stream. It works with the Standard Deviation to reveal the distribution of correlation in List Of Values.
It is based on the examination of the set of List Of Values. It is quite a good indicator of whether the content is unique and distinct, if the mean is low. On the contrary, if the mean is high, it means that several List Of Values are candidates for factorization.
Close this one

Help on Standard Deviation Content Duplication :
This is the standard deviation for correlation in the set of List Of Values, and it reveals the distribution of correlation. If this is low, say below 1, that's because only one List Of Values has correlated values. If this is high, that's because the Xml stream is not exactly optimized. Close this one

Help on Topmost Duplicated Content :
This is a value extracted from the Xml stream. It is based on the ranking of correlation in the List Of Values. The value is the first in this ranking. Close this one

Help on Content spacing :
The Content spacing topic has two indicators. The first is the indentation size ratio, as many Xml streams are gracefully decorated with useless spaces and tabs. The other indicator reveals if there is a significant amount of multiple spaces between attributes.
Both indentation and multiple spaces are useless. Close this one

Help on Content Indentation :
Indentation is a decoration, and is aimed to make Xml human readable but it has a high overhead in size. Let's see the sample below :
This one is 150 bytes. Now let's show the same content without indentation :

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore><!--J&R Booksellers Database--><Book Genre="Thriller"In_Stock="Yes"><Title>The Round Door</Title></Book></Bookstore>

This one reduces the size down to 136 bytes. If we take the whole Bookshop.xml sample, where we have only 4 books declared by the way, then the gain is 16%. As an experimental rule, the indentation ratio increases with the Xml stream size and may even overcome the real content size. Close this one

Help on Content Indentation Gain :
This is the resulting gain if we remove indentation from the whole Xml stream. Average is 10%. Close this one

Help on white spaces in Content :
This indicator shows if there are significant multiple white spaces between attributes in the Xml stream. Of course the multiple white spaces are useless. Only one suffice. Most Xml parsers don't allow no white space at all between attributes (unlike HTML), so most of the time this indicator come out with a not significant multiple white spaces ranking. Below is a sample of multiple white spaces :

  <book year="1999"  price="20$"/>

Close this one

?	Stream Size	8400 bytes
?	Nb total lines	329
?	Nb total elements	258
?	Nb total attributes	0
?	Nb total comments	0
?	Nb total CDATA sections	0
?	Nb total P. Instructions	0
?	Nb total namespaces used	0

?	Distinct patterns	1
?	Pattern occurences	53
?	Pattern height	5 lines
?	Pattern size	54 bytes
?	Flatten Pattern size	27 bytes	(50.00% less)
?	Gain obtained by flattening patterns	17.04%	(on overall stream size)

?	Max	4
?	Mean	3.55
?	Standard deviation	0.55
?	Total elements at depth 1	1 / 258	(0.39%)
?	<REPORTCARD>
?	Total elements at depth 2	15 / 258	(5.81%)
?	<STUDENT>
?	Total elements at depth 3	83 / 258	(32.17%)
?	<FNAME> <LNAME> <COURSE>
?	Total elements at depth 4	159 / 258	(61.63%)
?	<MARK> <SECTION> <COURSENAME>

?	show depth histogram chart

?	Nodename Minimum size	4 bytes
?	Nodename Maximum size	10 bytes
?	Nodename Mean size	6.57 bytes
?	Nodename Standard deviation	1.66
?	Occurences of nodenames 1-byte sized	0 times
?	Occurences of nodenames 2-byte sized	0 times
?	Occurences of nodenames 3-byte sized	0 times
?	Occurences of nodenames 4-byte sized	53 times	(424 bytes, 12.50%)
?	(MARK)
?	Occurences of nodenames 5-byte sized	30 times	(300 bytes, 8.84%)
?	(FNAME, LNAME)
?	Occurences of nodenames 6-byte sized	53 times	(636 bytes, 18.75%)
?	(COURSE)
?	Occurences of nodenames 7-byte sized	68 times	(952 bytes, 28.07%)
?	(STUDENT, SECTION)
?	Occurences of nodenames 8-byte sized	0 times
?	Occurences of nodenames 9-byte sized	0 times
?	Occurences of nodenames 10-byte sized	54 times	(1080 bytes, 31.84%)
?	(REPORTCARD, COURSENAME)
?	Total size of structure	3392 bytes
?	Gain obtained by reducing nodename sizes	34.24%	(new structure size is 516 bytes)

?	show naming histogram chart

?	Minimum Content size	2 bytes
?	Maximum Content size	9 bytes
?	Mean Content size	3.72 bytes
?	Ratio of empty element tags	0 / 258	(0.00%)
?	Ratio of empty attribute tags	N/A
?	Ratio of multiple part values	0.00%

?	Min Attributes per element	0
?	Max Attributes per element	0
?	Mean Attributes per element	0.00
?	Standard Deviation Attributes per element	0.00
?	Attributes disorder ratio	0.00%

?	Max content duplication ratio	83.02%
?	Mean content duplication	22.17%	(content is somewhat correlated)
?	Standard Deviation content duplication	60.97
?	Topmost duplicated content	"01"

?	Indentation	2853 bytes	(33.96% of overall stream size)
?	Gain obtained by removing indentation	33.96%
?	White spaces between attributes	not significant