Office 2007 .bin file formatStephane Rodriguez, August 2006 - this document is not endorsed by Microsoft.
Download OLE read/write (C++) - 9 Kb The new Office 2007 file formats are ZIP files that contain parts some of which are XML, some other are native file formats such as JPEG pictures, remaining binary parts end up being referred to as BIN parts. BIN parts are of particular interest for the file format consumer or updater since the underlying file formats are undocumented (at the time of writing, August 10 2006) and are several additional file formats to deal with. BIN parts appear in a number of cases. If you insert a VBA macro or an OLE object in a Word 2007, Excel 2007 or Powerpoint 2007 document, then there will be one or more BIN parts of interest. BIN parts are zip entries consisting of files with extension .BIN, that actually contain their own file format depending on the MIME type defined in the relationships part (xxx.rels) :
Before I get into further details about these two BIN parts, there are also BIN parts introduced by a new variant of the Excel 2007 file format known as Excel binary workbook which is a file ending with .XLSB. Apparently for performance reasons, it was decided to store an Excel file using a number of BIN parts instead of XML parts. Those BIN parts are a subset of the XML parts most affected by performance and scalability issues, most noteworthy each worksheet because of its arbitrary size. For some reason, the workbook, styles and a number of other small parts are also BIN parts despite that those contribute only marginally to the overall processing of the workbook. Again, there are underlying file formats to deal with for both the consumer and the implementer.
In addition to VBA projects and embedded OLE objects, we find BIN parts in Excel 2007 .XLSB files for the following reasons :
Note that the MIME content types don't differentiate whether the actual files are stored in XML or BIN. Out of the full list of parts taken from the Ecma specs, other optional parts of the Excel 2007 .XLSB file format are generally left as XML and thus the same than regular Excel 2007 files (.XLSX, .XLSM, ...) are :
A notable exception pointed above are legacy drawings, stored using the VML file format. While VML is XML markup, it requires an outstanding effort to read, write and possibly render from it. When legacy drawings are dropped in, they may (OLE object for instance) or may not (comment for instance) contain relations markup to other parts. Printer settings in Excel 2007 files are always stored as BIN parts whether it's .XLSB or not.
Reading or updating vbaProject.bin partsIn previous versions of Word, Excel and Powerpoint, VBA projects were stored as an OLE sub-container of the OLE document container. For the record, .doc / .dot / .xls / .xlt / .xlm / .xla / .ppt / .pot / .ppm / .ppa files are OLE document containers. As pictured below, we have created a Word 97 document and added a VBA macro to it. Notice the Macros sub-container. It contains a VBA container, which contains a number of streams, as well as two other streams. To view an OLE document container, you can either use one of the tools part of Visual Studio 6.0 known as the DocFile Viewer, or you can use an OLE viewer freely available here.
If you double-click on a stream, you can see the actual content. Obviously, each streams is a file format by itself, that needs to be paid special attention if you are hoping to read or update it. A basic scenario is to make a replacement of one stream by another, and does not imply you know anything that constitutes the streams themselves. Let's return to Word 2007, Excel 2007 and Powerpoint 2007. What you can do is extract the vbaProject.bin zip entry from a file with an inserted VBA macro and open it in the OLE viewer. What a surprise indeed when you see the following appear :
vbaProject.bin is the content of the Macros container defined above. To read or update vbaProject.bin parts, you need native API calls represented by IStream for streams, and IStorage for containers. As a sidenote, the mandatory reliance on native API calls makes any client code unable to execute in a partial trust environment such as Click-Once : executing native code implies full trust. Below is a sample code that reads an arbitrary OLE container using C++ and C#. Download OLE read/write (C++) - 9 Kb
Reading or updating oleObjectxxx.bin partsMuch like the vbaProject.bin parts, oleObjectxxx.bin parts are OLE sub-containers. You can use the same tool (Doc File viewer or equivalent) to view the content of the file, and you can use the same source code provided in the previous section to read or update that file. As an example, let's create a simple Excel 97 document, insert an OLE object in it (a Wordpad document for instance), then close it and view the resulting .xls file in the OLE viewer. Notice a MBD0032B277 sub-container with two streams inside :
Now let's return to Word 2007, Excel 2007 or Powerpoint 2007. Just extract the oleObjectxxx.bin part from any such file were you have also inserted an OLE object, open it in the OLE viewer, to see something equivalent to :
Download OLE read/write (C++) - 9 Kb What we have figured out so far is that BIN parts in the new file formats contain different underlying structures although they share common interfaces to traverse it (IStream/IStorage). To read and update VBA macros parts and OLE objects parts, you need interfaces to IStream and IStorage, and possibly the knowledge of the underlying content of streams (not a requirement in basic replacement scenarios). That applies equally to Word, Excel and Powerpoint. With Excel 2007 binary workbooks however, other BIN parts don't follow the structure and content.
Reading Excel 2007 BIN partsThe remainder of the article describes some of the BIN parts. Source code is provided in C++ and C# to read (and possibly write) those BIN parts. In addition to VBA project parts and OLE objects parts being documented in the first part of this article, in green are the parts about to get documented :
Those parts in green are sufficient to read and possibly update arbitrary cells complete with associated formatting in an Excel 2007 workbook.
Introducing BIFF12Each BIN part may be made up of its own underlying structure. Fortunately, most BIN parts share a common structure known as BIFF12. This word I have made up based on the name of the binary file format name of older Excel versions, where BIFF stands for Binary Interchange File Format. The history of BIFF is worth an article of its own, but the most interesting thing to know about it is that the latest known major revision was BIFF8, introduced in Excel 97. And the Excel team over at Microsoft were apparently so traumatized by the file format snafu that occured when customers were forced to migrate from Excel 95 to Excel 97, two different file formats, that they have never considered changing the file format version again, despite adding a ton of new records in BIFF8 in Excel 2000, Excel XP and Excel 2003. Excel 2007 continues that trend in the sense that, if you save back an Excel 2007 file as a "Excel97-2003" compatible file, then Excel 2007 will save the bulk of its features in new BIFF8 records, thereby enabling round-tripping scenarios. BIFF is an OLE stream part inside an OLE document container. VBA projects, OLE objects, document summary properties, and XML maps (Excel 2003) are also stored as separate OLE streams. BIFF8 is basically a sequence of records where each record is identified using two bytes, followed by the length of the record also in two bytes, followed by the record content itself. That's all there is. If you'd like to get the last public BIFF8 documentation from Microsoft (bundled with the MSO documentation), consider buying MSDN Library CDs of March 1998. BIFF12 kind of inherits this, but chose to make some interesting changes, like disregarding the existing BIFF8 record identifiers. Just like BIFF8, BIFF12 consists of a sequence of records consisting in an identifier, a length and the record content itself. Where it differs is that both the record identifiers and the record length are encoded using a variable-length technique. It works as follows : the first byte of the record identifier is read. If the most significant bit of that byte is set to 1, then another byte will have to be read, up to a maximum of 4 bytes (i.e. the record identifier can always be stored in a DWORD). This most significant bit is irrelevant and thus appropriate shifting needs to occur to construct a record identifier. So for example if you read byte 0x80, the most significant bit is set, you need to read another byte. Let's assume the other byte is 0x01, the most significant bit is not set, so the record identifier is obtained. Shifting aside, the record identifier is 0x0180. Note that, because you may want to match this record identifier against a number of known record identifiers, you can do the match using the unshifted record identifier, and as a bonus be able to figure out record identifiers in a BIFF12 hexadecimal dump in a straight forward manner. It works the same for the record length. Shifting to the left in the containing DWORD must occur in order to construct a proper length.
Reading BIFF12 recordsNote that, because Windows uses the little Endian notation, record identifiers and any two-byte, 4-byte or 8-byte value must be read right-to-left. If you are using C# and store the BIN part in a byte[] array, then you must provide the appropriate function helpers to decode such structure. Here is how to decode words (2 bytes), dwords (4 bytes), single-precision floats (4 bytes), double-precision floats (8 bytes), and strings : public static UInt16 GetWord(byte[] buffer, UInt32 offset) { UInt16 val = (UInt16) (buffer[offset + 1] << 8); val += (UInt16) (buffer[offset + 0]); return val; } public static UInt32 GetDword(byte[] buffer, UInt32 offset) { return ((UInt32)(buffer[offset + 3]) << 24) + ((UInt32)(buffer[offset + 2]) << 16) + ((UInt32)(buffer[offset + 1]) << 8) + ((UInt32)(buffer[offset + 0])); } public double GetDouble(byte[] buffer, UInt32 offset) { double d = 0; // ReadDouble() can read a IEEE 8-byte double straight from a buffer using (MemoryStream mem = new MemoryStream()) { BinaryWriter bw = new BinaryWriter(mem); for (UInt32 i = 0 ; i < 8; i++) bw.Write(buffer[offset + i]); mem.Seek(0,SeekOrigin.Begin); BinaryReader br = new BinaryReader(mem); d = br.ReadDouble(); br.Close(); bw.Close(); } return d; } public static String GetString(byte[] buffer, UInt32 offset, UInt32 len) { StringBuilder sb = new StringBuilder((int)len); for (UInt32 i = offset; i < offset + 2 * len; i += 2) sb.Append((Char)GetWord(buffer, i)); return sb.ToString(); } public static bool GetRecordID(byte[] buffer, ref UInt32 offset, ref UInt32 recid) { recid = 0; if (offset >= buffer.Length) return false; byte b1 = buffer[offset++]; recid = (UInt32)(b1 & 0x7F); if ((b1 & 0x80) == 0) return true; if (offset >= buffer.Length) return false; byte b2 = buffer[offset++]; recid = ((UInt32)(b2 & 0x7F) << 7) | recid; if ((b2 & 0x80) == 0) return true; if (offset >= buffer.Length) return false; byte b3 = buffer[offset++]; recid = ((UInt32)(b3 & 0x7F) << 14) | recid; if ((b3 & 0x80) == 0) return true; if (offset >= buffer.Length) return false; byte b4 = buffer[offset++]; recid = ((UInt32)(b4 & 0x7F) << 21) | recid; return true; } As a sidenote, the structure of strings stored inside records is the following : 4 bytes for the length (encoded in little Endian) which defines the number of string characters (not bytes) to follow, followed by such number of Unicode characters (2 bytes each, also encoded in little Endian). The strings are never zero-terminated, and I have never encountered in this reverse engineering game strings encoded in something else than Unicode. Once you've got this, you can read a BIFF12 structure with code like this : // A generic BIFF12 part is a sequence of BIFF12 records // A BIFF12 record is a record identifier, followed by a record length, followed by the content itself // The record identifier is stored in variable length // The record length is stored in variable length as well // The record content is arbitrary content whose underlying structure is associated to the // the record identifier, and is defined once for all by the implementers of the file format // It is the responsability of a record handler to parse any underlying record structure UInt32 offset = 0; while (offset < buffer.Length) { UInt32 recid = 0; UInt32 reclen = 0; if (!BaseRecord.GetRecordID(buffer, ref offset, ref recid) || !BaseRecord.GetRecordLen(buffer, ref offset, ref reclen)) { Console.WriteLine("***Damaged buffer***"); break; } // h is a hashTable which registers record handlers BaseRecord recHandler = (BaseRecord) h[recid]; if (recHandler != null) { Console.Write( String.Format("<{0}>\r\n[rec=0x{1:X} len=0x{2:X}]", recHandler.GetTag(), recid, reclen) ); for (int i = 0; i < reclen; i++) { Console.Write( String.Format(" {0:X2}", buffer[offset + i]) ); } Console.WriteLine(); // decode the record content itself, // and possibly the underlying structure if any recHandler.Read(buffer, ref offset, recid, reclen, h, w); if (offset == UInt32.MaxValue) { Console.WriteLine("***Damaged buffer***"); break; } } else { Console.Write( String.Format("[rec=0x{0:X} len=0x{1:X}]", recid, reclen) ); // we have no idea what this thing is, just dump the content in hexa for (int i = 0; i < reclen; i++) { Console.Write( String.Format(" {0:X2}", buffer[offset + i]) ); } Console.WriteLine(); } offset += reclen; Console.WriteLine(); } When applying this code to a worksheet BIN part, it produces the following : // Here is how to read what follows // // For each record known by the BIFF12 reader, we come up with the XML markup tag associated // to the record. This provides clues as how to swap from the XML part to the BIN part // and vice versa and is easier to understand. // // Followed by square brackets enclosing the record identifier and associated length // // Followed by the record content itself (i.e. nothing if the length is zero) // // Anytime the record has an underlying structure (as with <sheetData>, the structure // is decoded, and human readable info is provided). *** Dumping a worksheet part <worksheet> [rec=0x181 len=0x0] <sheetPr> [rec=0x193 len=0xF] C9 04 02 00 40 00 00 00 00 00 00 00 00 00 00 info : <tabColor rgb=.../> info : <outlinePr showOutlineSymbols=.../> info : <pageSetUpPr .../> info : </sheetPr> <dimension> [rec=0x194 len=0x10] 04 00 00 00 04 00 00 00 00 00 00 00 07 00 00 00 info : r1=4, c1=0, r2=4, c2=7 <sheetViews> [rec=0x185 len=0x0] <sheetView> [rec=0x189 len=0x1E] DC 03 00 00 00 00 01 00 00 00 00 00 00 00 40 00 00 00 64 00 00 00 00 00 00 00 00 00 00 00 <selection> [rec=0x198 len=0x24] 03 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 04 00 00 00 04 00 00 00 00 00 00 00 FF 3F 00 00 </sheetView> [rec=0x18A len=0x0] </sheetViews> [rec=0x186 len=0x0] <sheetFormatPr> [rec=0x3E5 len=0xC] FF FF FF FF 08 00 2C 01 00 00 00 01 <cols> [rec=0x386 len=0x0] info : colmin=1, colmax=2, width=9,140625, style=1, outline=false, resize=false, hidden=false info : colmin=4, colmax=5, width=9,140625, style=2, outline=false, resize=false, hidden=false info : colmin=6, colmax=6, width=9,140625, style=2, outline=true, resize=true, hidden=false info : colmin=7, colmax=7, width=9,140625, style=0, outline=true, resize=true, hidden=false info : colmin=8, colmax=8, width=0, style=0, outline=false, resize=true, hidden=true info : colmin=9, colmax=9, width=11, style=0, outline=false, resize=true, hidden=false </cols> [rec=0x387 len=0x0] <sheetData> [rec=0x191 len=0x0] info : row=4, height=405, style=0, outline=false, resize=true, hidden=false info : col=0, style=0, v:stringindex=0 v:string=a </sheetData> [rec=0x192 len=0x0] [rec=0x497 len=0x42] 00 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 <printOptions> [rec=0x3DD len=0x2] 10 00 <pageMargins> [rec=0x3DC len=0x30] 66 66 66 66 66 66 E6 3F 66 66 66 66 66 66 E6 3F 00 00 00 00 00 00 E8 3F 00 00 00 00 00 00 E8 3F 33 33 33 33 33 33 D3 3F 33 33 33 33 33 33 D3 3F <pageSetup> [rec=0x3DE len=0x22] 01 00 00 00 64 00 00 00 2C 01 00 00 2C 01 00 00 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 00 00 <headerFooter> [rec=0x3DF len=0x1A] 0C 00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF [rec=0x3E0 len=0x0] </worksheet> [rec=0x182 len=0x0]
BIFF12 records for a number of BIN partsSo what are some of those important records you ask? // Workbook records public const int BIFF12_DEFINEDNAME = 0x27; public const int BIFF12_FILEVERSION = 0x0180; public const int BIFF12_WORKBOOK = 0x0183; public const int BIFF12_WORKBOOK_END = 0x0184; public const int BIFF12_BOOKVIEWS = 0x0187; public const int BIFF12_BOOKVIEWS_END = 0x0188; public const int BIFF12_SHEETS = 0x018F; public const int BIFF12_SHEETS_END = 0x0190; public const int BIFF12_WORKBOOKPR = 0x0199; public const int BIFF12_SHEET = 0x019C; public const int BIFF12_CALCPR = 0x019D; public const int BIFF12_WORKBOOKVIEW = 0x019E; public const int BIFF12_EXTERNALREFERENCES = 0x02E1; public const int BIFF12_EXTERNALREFERENCES_END = 0x02E2; public const int BIFF12_EXTERNALREFERENCE = 0x02E3; public const int BIFF12_WEBPUBLISHING = 0x04A9; // Worksheet records public const int BIFF12_ROW = 0x00; public const int BIFF12_BLANK = 0x01; public const int BIFF12_NUM = 0x02; public const int BIFF12_BOOLERR = 0x03; public const int BIFF12_BOOL = 0x04; public const int BIFF12_FLOAT = 0x05; public const int BIFF12_STRING = 0x07; public const int BIFF12_FORMULA_STRING = 0x08; public const int BIFF12_FORMULA_FLOAT = 0x09; public const int BIFF12_FORMULA_BOOL = 0x0A; public const int BIFF12_FORMULA_BOOLERR = 0x0B; public const int BIFF12_COL = 0x3C; public const int BIFF12_WORKSHEET = 0x0181; public const int BIFF12_WORKSHEET_END = 0x0182; public const int BIFF12_SHEETVIEWS = 0x0185; public const int BIFF12_SHEETVIEWS_END = 0x0186; public const int BIFF12_SHEETVIEW = 0x0189; public const int BIFF12_SHEETVIEW_END = 0x018A; public const int BIFF12_SHEETDATA = 0x0191; public const int BIFF12_SHEETDATA_END = 0x0192; public const int BIFF12_SHEETPR = 0x0193; public const int BIFF12_DIMENSION = 0x0194; public const int BIFF12_SELECTION = 0x0198; public const int BIFF12_COLS = 0x0386; public const int BIFF12_COLS_END = 0x0387; public const int BIFF12_CONDITIONALFORMATTING = 0x03CD; public const int BIFF12_CONDITIONALFORMATTING_END = 0x03CE; public const int BIFF12_CFRULE = 0x03CF; public const int BIFF12_CFRULE_END = 0x03D0; public const int BIFF12_ICONSET = 0x03D1; public const int BIFF12_ICONSET_END = 0x03D2; public const int BIFF12_DATABAR = 0x03D3; public const int BIFF12_DATABAR_END = 0x03D4; public const int BIFF12_COLORSCALE = 0x03D5; public const int BIFF12_COLORSCALE_END = 0x03D6; public const int BIFF12_CFVO = 0x03D7; public const int BIFF12_PAGEMARGINS = 0x03DC; public const int BIFF12_PRINTOPTIONS = 0x03DD; public const int BIFF12_PAGESETUP = 0x03DE; public const int BIFF12_HEADERFOOTER = 0x03DF; public const int BIFF12_SHEETFORMATPR = 0x03E5; public const int BIFF12_HYPERLINK = 0x03EE; public const int BIFF12_DRAWING = 0x04A6; public const int BIFF12_LEGACYDRAWING = 0x04A7; public const int BIFF12_COLOR = 0x04B4; public const int BIFF12_OLEOBJECTS = 0x04FE; public const int BIFF12_OLEOBJECT = 0x04FF; public const int BIFF12_OLEOBJECTS_END = 0x0580; public const int BIFF12_TABLEPARTS = 0x0594; public const int BIFF12_TABLEPART = 0x0595; public const int BIFF12_TABLEPARTS_END = 0x0596; //SharedStrings records public const int BIFF12_SI = 0x13; public const int BIFF12_SST = 0x019F; public const int BIFF12_SST_END = 0x01A0; //Styles records public const int BIFF12_FONT = 0x2B; public const int BIFF12_FILL = 0x2D; public const int BIFF12_BORDER = 0x2E; public const int BIFF12_XF = 0x2F; public const int BIFF12_CELLSTYLE = 0x30; public const int BIFF12_STYLESHEET = 0x0296; public const int BIFF12_STYLESHEET_END = 0x0297; public const int BIFF12_COLORS = 0x03D9; public const int BIFF12_COLORS_END = 0x03DA; public const int BIFF12_DXFS = 0x03F9; public const int BIFF12_DXFS_END = 0x03FA; public const int BIFF12_TABLESTYLES = 0x03FC; public const int BIFF12_TABLESTYLES_END = 0x03FD; public const int BIFF12_FILLS = 0x04DB; public const int BIFF12_FILLS_END = 0x04DC; public const int BIFF12_FONTS = 0x04E3; public const int BIFF12_FONTS_END = 0x04E4; public const int BIFF12_BORDERS = 0x04E5; public const int BIFF12_BORDERS_END = 0x04E6; public const int BIFF12_CELLXFS = 0x04E9; public const int BIFF12_CELLXFS_END = 0x04EA; public const int BIFF12_CELLSTYLES = 0x04EB; public const int BIFF12_CELLSTYLES_END = 0x04EC; public const int BIFF12_CELLSTYLEXFS = 0x04F2; public const int BIFF12_CELLSTYLEXFS_END = 0x04F3; //Comment records public const int BIFF12_COMMENTS = 0x04F4; public const int BIFF12_COMMENTS_END = 0x04F5; public const int BIFF12_AUTHORS = 0x04F6; public const int BIFF12_AUTHORS_END = 0x04F7; public const int BIFF12_AUTHOR = 0x04F8; public const int BIFF12_COMMENTLIST = 0x04F9; public const int BIFF12_COMMENTLIST_END = 0x04FA; public const int BIFF12_COMMENT = 0x04FB; public const int BIFF12_COMMENT_END = 0x04FC; public const int BIFF12_TEXT = 0x04FD; //Table records public const int BIFF12_AUTOFILTER = 0x01A1; public const int BIFF12_AUTOFILTER_END = 0x01A2; public const int BIFF12_FILTERCOLUMN = 0x01A3; public const int BIFF12_FILTERCOLUMN_END= 0x01A4; public const int BIFF12_FILTERS = 0x01A5; public const int BIFF12_FILTERS_END = 0x01A6; public const int BIFF12_FILTER = 0x01A7; public const int BIFF12_TABLE = 0x02D7; public const int BIFF12_TABLE_END = 0x02D8; public const int BIFF12_TABLECOLUMNS = 0x02D9; public const int BIFF12_TABLECOLUMNS_END= 0x02DA; public const int BIFF12_TABLECOLUMN = 0x02DB; public const int BIFF12_TABLECOLUMN_END = 0x02DC; public const int BIFF12_TABLESTYLEINFO = 0x0481; public const int BIFF12_SORTSTATE = 0x0492; public const int BIFF12_SORTCONDITION = 0x0494; public const int BIFF12_SORTSTATE_END = 0x0495; //QueryTable records public const int BIFF12_QUERYTABLE = 0x03BF; public const int BIFF12_QUERYTABLE_END = 0x03C0; public const int BIFF12_QUERYTABLEREFRESH = 0x03C1; public const int BIFF12_QUERYTABLEREFRESH_END = 0x03C2; public const int BIFF12_QUERYTABLEFIELDS = 0x03C7; public const int BIFF12_QUERYTABLEFIELDS_END = 0x03C8; public const int BIFF12_QUERYTABLEFIELD = 0x03C9; public const int BIFF12_QUERYTABLEFIELD_END = 0x03CA; //Connection records public const int BIFF12_CONNECTIONS = 0x03AD; public const int BIFF12_CONNECTIONS_END = 0x03AE; public const int BIFF12_CONNECTION = 0x01C9; public const int BIFF12_CONNECTION_END = 0x01CA; public const int BIFF12_DBPR = 0x01CB; public const int BIFF12_DBPR_END = 0x01CC;
Workbook partHow to proceed for each BIN part of interest is quite straight forward in fact. You can create a regular Excel 2007 file that uses a particular feature, for instance a chart, and save it both as .XLSX and .XLSB. Then you can unzip the content in separate folders, take the parts side-by-side and try to figure out which XML markup corresponds to which BIFF12 record identifier. The XML markup itself is not mandatory at all, it only makes the whole thing approachable for human beings... I have done some of this work for a number of important BIN parts, but I concentrated on what was really needed to make sure I was able to read the content of cells. That's why, while records are for the most part identified and matched with their XML markup siblings, the record content itself is not. It's not either because I was too lazy to do it, or because it does not serve the goal, or because it can't be disambiguated that easily. For instance, if you take a look at a regular workbook BIN part below, you'll notice a few records with no associated XML markup : // This is what a workbook part looks like in XML <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"> <fileVersion lastEdited="4" lowestEdited="4" rupBuild="4017"/> <workbookPr defaultThemeVersion="123820"/> <bookViews> <workbookView xWindow="360" yWindow="60" windowWidth="11295" windowHeight="5580"/> </bookViews> <sheets> <sheet name="Sheet1" sheetId="1" r:id="rId1"/> <sheet name="Sheet2" sheetId="2" r:id="rId2"/> <sheet name="Sheet3" sheetId="3" r:id="rId3"/> </sheets> <calcPr calcId="122211"/> <webPublishing codePage="1252"/> </workbook> // This is what a workbook part looks like in BIN 83 01 00 80 01 14 04 04 b1 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 99 01 0c 20 00 01 00 ac e3 01 00 00 00 00 00 87 01 00 9e 01 1d 68 01 00 00 3c 00 00 00 1f 2c 00 00 cc 15 00 00 58 02 00 00 00 00 00 00 00 00 00 00 78 88 01 00 8f 01 00 9c 01 28 00 00 00 00 00 00 00 00 01 00 00 00 04 00 00 00 72 00 49 00 64 00 31 00 06 00 00 00 53 00 68 00 65 00 65 00 74 00 31 00 9c 01 28 00 00 00 00 00 00 00 00 02 00 00 00 04 00 00 00 72 00 49 00 64 00 32 00 06 00 00 00 53 00 68 00 65 00 65 00 74 00 32 00 9c 01 28 00 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 72 00 49 00 64 00 33 00 06 00 00 00 53 00 68 00 65 00 65 00 74 00 33 00 90 01 00 9d 01 19 63 dd 01 00 01 00 00 00 64 00 00 00 fc a9 f1 d2 4d 62 50 3f 01 00 00 00 6a 96 04 06 00 00 00 00 00 00 9a 01 01 00 a9 04 0b 07 00 03 60 00 00 00 e4 04 00 00 9b 01 01 00 84 01 00 // This is how, after breaking the BIN part in records, you can match // records to the XML markup. // Note that record identifiers are the two bytes on the left, the associated record length // is surrounded by parentheses, and it's followed by the record content itself. <workbook> 83 01 (00) <fileVersion lastEdited="4" lowestEdited="4" rupBuild="4017"/> (hint : 4017 = 0x0FB1) 80 01 (14) 04 04 b1 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <workbookPr defaultThemeVersion="123820"/> (hint : 123820 = 0x0001E3AC) 99 01 (0c) 20 00 01 00 ac e3 01 00 00 00 00 00 <bookViews> 87 01 (00) <workbookView xWindow="360" yWindow="60" windowWidth="11295" windowHeight="5580"/> 9e 01 (1d) 68 01 00 00 3c 00 00 00 1f 2c 00 00 cc 15 00 00 58 02 00 00 00 00 00 00 00 00 00 00 78 </bookViews> 88 01 (00) <sheets> 8f 01 (00) <sheet name="Sheet1" sheetId="1" r:id="rId1"/> 9c 01 (28) 00 00 00 00 00 00 00 00 01 00 00 00 sheetid 04 00 00 00 length of string to follow (in characters, not bytes) 72 00 49 00 64 00 31 00 relation identifier 06 00 00 00 length of string to follow (in characters, not bytes) 53 00 68 00 65 00 65 00 74 00 31 00 sheetname <sheet name="Sheet2" sheetId="2" r:id="rId2"/> 9c 01 (28) 00 00 00 00 00 00 00 00 02 00 00 00 sheetid 04 00 00 00 length of string to follow (in characters, not bytes) 72 00 49 00 64 00 32 00 relation identifier 06 00 00 00 length of string to follow (in characters, not bytes) 53 00 68 00 65 00 65 00 74 00 32 00 sheetname <sheet name="Sheet3" sheetId="3" r:id="rId3"/> 9c 01 (28) 00 00 00 00 00 00 00 00 03 00 00 00 sheetid 04 00 00 00 length of string to follow (in characters, not bytes) 72 00 49 00 64 00 33 00 relation identifier 06 00 00 00 length of string to follow (in characters, not bytes) 53 00 68 00 65 00 65 00 74 00 33 00 sheetname </sheets> 90 01 (00) <externalReferences> e1 02 (00) <externalReference r:id="rId4" /> e3 02 (0c) 04 00 00 00 length of string to follow (in characters, not bytes) 72 00 49 00 64 00 34 00 string representing a relation identifier e5 02 (00) ea 02 1c 02 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 </externalReferences> e2 02 (00) <definedName name="externalrange" comment="">[1]Sheet1!$B$3</definedName> 27 (3c) 00 00 grbits 00 00 00 ff ff ff ff nametype 0d 00 00 00 length of string to follow (in characters, not bytes) 65 00 78 00 74 00 65 00 72 00 6e 00 61 00 6c 00 72 00 61 00 6e 00 67 00 65 00 defined name 09 00 00 00 length of formula to follow (in bytes) 3a 01 00 02 00 00 00 01 00 formula 00 00 00 00 00 00 00 00 <definedName name="anotherrange">Sheet1!$B$9:$C$10</definedName> 27 (40) 00 00 grbits 00 00 00 ff ff ff ff nametype 0c 00 00 00 length of string to follow (in characters, not bytes) 61 00 6e 00 6f 00 74 00 68 00 65 00 72 00 72 00 61 00 6e 00 67 00 65 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 00 00 08 00 00 00 09 00 00 00 01 00 02 00 formula 00 00 00 00 ff ff ff ff <definedName name="Database1" localSheetId="1" hidden="1">Sheet2!$A$1:$D$6</definedName> 27 (3a) 01 00 grbits 00 00 00 01 00 00 00 nametype 09 00 00 00 length of string to follow (in characters, not bytes) 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 31 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 01 00 00 00 00 00 05 00 00 00 00 00 03 00 formula 00 00 00 00 ff ff ff ff <definedName name="myrange">Sheet1!$C$2:$D$3</definedName> 27 (36) 00 00 grbits 00 00 00 ff ff ff ff nametype 07 00 00 00 length of string to follow (in characters, not bytes) 6d 00 79 00 72 00 61 00 6e 00 67 00 65 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 00 00 01 00 00 00 02 00 00 00 02 00 03 00 formula 00 00 00 00 ff ff ff ff <definedName name="_xlnm.Print_Area" localSheetId="0">Sheet1!$A$1:$E$7</definedName> 27 (3c) 20 00 grbits 00 00 00 00 00 00 00 nametype 0a 00 00 00 length of string to follow (in characters, not bytes) 50 00 72 00 69 00 6e 00 74 00 5f 00 41 00 72 00 65 00 61 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 00 00 00 00 00 00 06 00 00 00 00 00 04 00 formula 00 00 00 00 ff ff ff ff <definedName name="_xlnm.Print_Titles" localSheetId="0">Sheet1!$2:$3</definedName> 27 (40) 20 00 grbits 00 00 00 00 00 00 00 nametype 0c 00 00 00 length of string to follow (in characters, not bytes) 50 00 72 00 69 00 6e 00 74 00 5f 00 54 00 69 00 74 00 6c 00 65 00 73 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 00 00 01 00 00 00 02 00 00 00 00 00 ff 3f formula 00 00 00 00 ff ff ff ff <definedName name="myrange" hidden="1">Sheet1!$B$2:$B$3</definedName> 27 (36) 01 00 grbits 00 00 00 ff ff ff ff nametype 07 00 00 00 length of string to follow (in characters, not bytes) 6d 00 79 00 72 00 61 00 6e 00 67 00 65 00 defined name 0f 00 00 00 length of formula to follow (in bytes) 3b 00 00 01 00 00 00 02 00 00 00 01 00 01 00 formula 00 00 00 00 ff ff ff ff <calcPr calcId="122211"/> 9d 01 (19) 63 dd 01 00 01 00 00 00 64 00 00 00 fc a9 f1 d2 4d 62 50 3f 01 00 00 00 6a 96 04 (06) 00 00 00 00 00 00 9a 01 (01) 00 <webPublishing codePage="1252"/> a9 04 (0b) 07 00 03 60 00 00 00 e4 04 00 00 9b 01 (01) 00 </workbook> 84 01 (00) As you can see, I could not easily find the markup associated to record identifiers 0x0496, 0x019A and 0x019B above. While it's not blocking at this point, let's take a few moments to discuss the issue.
How to read the workbook BIN part using C# or C++ ? Using the source code provided in attachment to the article, you can easily do that. // // how to read the workbook part in C# // Hashtable h = new Hashtable(); // Register Workbook record handlers h[C.BIFF12_DEFINEDNAME] = new DefinedNameRecord(); h[C.BIFF12_FILEVERSION] = new FileVersionRecord(); h[C.BIFF12_WORKBOOK] = new WorkbookRecord(); h[C.BIFF12_WORKBOOK_END] = new WorkbookEndRecord(); h[C.BIFF12_BOOKVIEWS] = new BookViewsRecord(); h[C.BIFF12_BOOKVIEWS_END] = new BookViewsEndRecord(); h[C.BIFF12_SHEETS] = new SheetsRecord(); h[C.BIFF12_SHEETS_END] = new SheetsEndRecord(); h[C.BIFF12_WORKBOOKPR] = new WorkbookPRRecord(); h[C.BIFF12_SHEET] = new SheetRecord(); h[C.BIFF12_CALCPR] = new CalcPRRecord(); h[C.BIFF12_WORKBOOKVIEW] = new WorkbookViewRecord(); h[C.BIFF12_EXTERNALREFERENCES] = new ExternalReferencesRecord(); h[C.BIFF12_EXTERNALREFERENCES_END] = new ExternalReferencesEndRecord(); h[C.BIFF12_EXTERNALREFERENCE] = new ExternalReferenceRecord(); h[C.BIFF12_WEBPUBLISHING] = new WebPublishingRecord(); // This class can be used to pass objects deeper in the reading flow Workbook w = new Workbook(); // Note that the part is already unzipped and available in a regular folder using (FileStream fs = new FileStream(@"..\..\Excel12_files\Book1.xlsb\xl\workbook.bin", FileMode.Open, FileAccess.Read)) { // use the BCL to load the part in a byte[] buffer byte[] bufferWorkbookPart = new BinaryReader(fs).ReadBytes((int)fs.Length); // use our BIFF12 reader Read(w, h, bufferWorkbookPart); } The workbook provides the list of worksheet references we are interested in. For each worksheet, we can see :
Worksheet partThe worksheet part describes the values in cells, along with formulas whenever it applies, and also describes objects mapping to cells or cell ranges (charts, pivot tables, ...). Here is an example of reverse engineered worksheet which contains a double-precision float, a float formula which happens to return a "division by zero" error, and a copy of that cell elsewhere :
// This is what a worksheet part looks like in XML <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"> <dimension ref="D3:D6"/> <sheetViews> <sheetView tabSelected="1" workbookViewId="0"> <selection activeCell="D4" sqref="D4"/> </sheetView> </sheetViews> <sheetFormatPr defaultRowHeight="15"/> <cols> <col min="4" max="4" width="12.5703125" customWidth="1"/> </cols> <sheetData> <row r="3" spans="4:4"><c r="D3"><v>2.123456789</v></c></row> <row r="4" spans="4:4"><c r="D4" t="e"><f>5/E3</f><v>#DIV/0!</v></c></row> <row r="6" spans="4:4"><c r="D6" t="e"><v>#DIV/0!</v></c></row> </sheetData> <printOptions/> <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/> <headerFooter/> </worksheet> // This is how, after breaking the BIN part in records, you can match // records to the XML markup. // Note that record identifiers are the two bytes on the left, the associated record length // is surrounded by parentheses, and it's followed by the record content itself. <worksheet/> 81 01 (00) <sheetPr/> (figured out) 93 01 (0f) c9 04 02 00 40 00 00 00 00 00 00 00 00 00 00 <dimension ref="D3:D6"/> 94 01 (10) 02 00 00 00 05 00 00 00 03 00 00 00 03 00 00 00 <sheetViews> 85 01 (00) <sheetView tabSelected="1" workbookViewId="0"> 89 01 (1e) dc 03 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00 64 00 00 00 00 00 00 00 00 00 00 00 <selection activeCell="D4" sqref="D4"/> 98 01 (24) 03 00 00 00 02 00 00 00 02 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 </sheetView> 8a 01 (00) </sheetViews> 86 01 (00) <sheetFormatPr defaultRowHeight="15"/> e5 03 (0c) ff ff ff ff 08 00 2c 01 00 00 00 00 <cols> 86 03 (00) <col min="4" max="4" width="12.5703125" customWidth="1"/> 3c (12) 03 00 00 00 colmin (0-based) 03 00 00 00 colmax (0-based) 92 0c 00 00 width * 256 00 00 00 00 style (0-based) 02 00 flags </cols> 87 03 (00) <sheetData> 91 01 (00) <row r="3" spans="4:4"></row> 00 (19) 02 00 00 00 00 00 00 00 2c 01 00 00 00 01 00 00 00 03 00 00 00 03 00 00 00 <c r="D3"><v>2.123456789</v></c> 05 (10) 03 00 00 00 col (0-based) 00 00 00 00 style (0-based) 1b cb b9 e9 d6 fc 00 40 float (IEEE 8 bytes) <row r="4" spans="4:4"></row> 00 (19) 03 00 00 00 00 00 00 00 2c 01 00 00 00 01 00 00 00 03 00 00 00 03 00 00 00 <c r="D4" t="e"><f>5/E3</f><v>#DIV/0!</v></c> 0b (1e) 03 00 00 00 col (0-based) 00 00 00 00 style (0-based) 07 boolerr (7 = DIV/0) 00 00 grbits 0b 00 00 00 len of formula to follow in bytes 1e 05 00 44 02 00 00 00 04 c0 06 formula (1E = ptgTokenInt (5) 00 00 00 00 44 = ptgTokenRefV (E3) 06 = ptgTokenDiv) <row r="6" spans="4:4"></row> 00 (19) 05 00 00 00 00 00 00 00 2c 01 00 00 00 01 00 00 00 03 00 00 00 03 00 00 00 <c r="D6" t="e"><v>#DIV/0!</v></c> 03 (09) 03 00 00 00 col 00 00 00 00 style 07 boolerr (7 = DIV/0) </sheetData> 92 01 (00) 97 04 (42) 00 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 <conditionalFormatting sqref="B3:C4"> cd 03 (1c) 01 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 01 00 00 00 02 00 00 00 <cfRule type="dataBar" priority="3"> cf 03 (8c) 02 04 00 00 00 03 00 00 00 ff ff ff ff 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 67 00 00 00 67 00 00 00 00 00 00 00 ff ff ff ff <formula>MAX(IF(ISBLANK($B$3:$C$4), "", IF(ISERROR($B$3:$C$4), "", $B$3:$C$4)))</formula> 67 00 00 00 65 02 00 00 00 03 00 00 00 01 00 02 00 61 81 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 43 00 65 02 00 00 00 03 00 00 00 01 00 02 00 61 03 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 1c 00 19 40 00 01 25 02 00 00 00 03 00 00 00 01 00 02 00 19 40 00 01 19 08 03 00 22 03 01 00 19 08 03 00 22 03 01 00 42 01 07 00 00 00 00 00 <formula>MAX(IF(ISBLANK($B$3:$C$4), "", IF(ISERROR($B$3:$C$4), "", $B$3:$C$4)))</formula> 67 00 00 00 65 02 00 00 00 03 00 00 00 01 00 02 00 61 81 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 43 00 65 02 00 00 00 03 00 00 00 01 00 02 00 61 03 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 1c 00 19 40 00 01 25 02 00 00 00 03 00 00 00 01 00 02 00 19 40 00 01 19 08 03 00 22 03 01 00 19 08 03 00 22 03 01 00 42 01 06 00 00 00 00 00 <dataBar> d3 03 (03) 0a 5a 01 <cfvo type="min" val="0" /> d7 03 (18) 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <cfvo type="max" val="0" /> d7 03 (18) 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <color rgb="FF638EC6" /> (hint : color encoding = BGR not RGB) b4 04 (08) 05 ff 00 00 63 8e c6 ff </dataBar> d4 03 (00) </cfRule> d0 03 (00) </conditionalFormatting> ce 03 (00) <conditionalFormatting sqref="B6:C7"> cd 03 (1c) 01 00 00 00 00 00 00 00 01 00 00 00 05 00 00 00 06 00 00 00 01 00 00 00 02 00 00 00 <cfRule type="colorScale" priority="2"> cf 03 (8c) 02 03 00 00 00 02 00 00 00 ff ff ff ff 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 67 00 00 00 67 00 00 00 00 00 00 00 ff ff ff ff <formula>MAX(IF(ISBLANK($B$6:$C$7), "", IF(ISERROR($B$6:$C$7), "", $B$6:$C$7)))</formula> 67 00 00 00 65 05 00 00 00 06 00 00 00 01 00 02 00 61 81 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 43 00 65 05 00 00 00 06 00 00 00 01 00 02 00 61 03 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 1c 00 19 40 00 01 25 05 00 00 00 06 00 00 00 01 00 02 00 19 40 00 01 19 08 03 00 22 03 01 00 19 08 03 00 22 03 01 00 42 01 07 00 00 00 00 00 <formula>MAX(IF(ISBLANK($B$6:$C$7), "", IF(ISERROR($B$6:$C$7), "", $B$6:$C$7)))</formula> 67 00 00 00 65 05 00 00 00 06 00 00 00 01 00 02 00 61 81 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 43 00 65 05 00 00 00 06 00 00 00 01 00 02 00 61 03 00 19 02 0b 00 19 40 00 01 17 00 00 19 08 1c 00 19 40 00 01 25 05 00 00 00 06 00 00 00 01 00 02 00 19 40 00 01 19 08 03 00 22 03 01 00 19 08 03 00 22 03 01 00 42 01 06 00 00 00 00 00 <colorScale> d5 03 (00) <cfvo type="min" val="0" /> d7 03 (18) 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <cfvo type="percent" val="50" /> d7 03 (18) 04 00 00 00 00 00 00 00 00 00 49 40 00 00 00 00 00 00 00 00 00 00 00 00 <cfvo type="max" val="0" /> d7 03 (18) 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <color rgb="FFF8696B" /> (hint : color encoding = BGR not RGB) b4 04 (08) 05 ff 00 00 f8 69 6b ff <color rgb="FFFFEB84" /> (hint : color encoding = BGR not RGB) b4 04 (08) 05 ff 00 00 ff eb 84 ff <color rgb="FF63BE7B" /> (hint : color encoding = BGR not RGB) b4 04 (08) 05 ff 00 00 63 be 7b ff </colorScale> d6 03 (00) </cfRule> d0 03 (00) </conditionalFormatting> ce 03 (00) <conditionalFormatting sqref="B9:C10"> cd 03 (1c) 01 00 00 00 00 00 00 00 01 00 00 00 08 00 00 00 09 00 00 00 01 00 00 00 02 00 00 00 <cfRule type="iconSet" priority="1"> cf 03 (2e) 06 00 00 00 04 00 00 00 ff ff ff ff 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff <iconSet iconSet="3TrafficLights2"> d1 03 (06) 04 00 00 00 78 00 <cfvo type="percentile" val="0" /> d7 03 (18) 05 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 <cfvo type="percentile" val="0" /> d7 03 (18) 05 00 00 00 00 00 00 00 00 80 40 40 01 00 00 00 01 00 00 00 00 00 00 00 <cfvo type="percentile" val="0" /> d7 03 (18) 05 00 00 00 00 00 00 00 00 c0 50 40 01 00 00 00 01 00 00 00 00 00 00 00 </iconSet> d2 03 (00) </cfRule> d0 03 (00) </conditionalFormatting> ce 03 (00) <hyperlink ref="C3" r:id="rId1"/> ee 03 (28) 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 04 00 00 00 72 00 49 00 64 00 32 00 00 00 00 00 00 00 00 00 00 00 00 00 <printOptions/> dd 03 (02) 10 00 <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/> dc 03 (30) 66 66 66 66 66 66 e6 3f 66 66 66 66 66 66 e6 3f 00 00 00 00 00 00 e8 3f 00 00 00 00 00 00 e8 3f 33 33 33 33 33 33 d3 3f 33 33 33 33 33 33 d3 3f <headerFooter/> df 03 (1a) 0c 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff e0 03 (00) </worksheet> 82 01 (00) Above, there are examples of underlying structure, one of involving altered columns, another involving the rows and actual cells. Each cell holds a particular value type which happens to be reflected by the record identifier. And it differs whether it's a raw value, or a value with an associated formula. Hence //Cell records public const int BIFF12_ROW = 0x00; // row info public const int BIFF12_BLANK = 0x01; // empty cell public const int BIFF12_NUM = 0x02; // single-precision float public const int BIFF12_BOOLERR = 0x03; // error identifier public const int BIFF12_BOOL = 0x04; // boolean value public const int BIFF12_FLOAT = 0x05; // double-precision float public const int BIFF12_STRING = 0x07; // string (shared string index) public const int BIFF12_FORMULA_STRING = 0x08; // formula returning a string (inline string) public const int BIFF12_FORMULA_FLOAT = 0x09; // formula returning a double-precision float public const int BIFF12_FORMULA_BOOL = 0x0A; // formula returning a boolean public const int BIFF12_FORMULA_BOOLERR = 0x0B; // formula returning an error identifier //TODO : find the opcode for inline strings (it's probably 0x06). Let's take a moment to discuss how formulas are stored :
Also, strings can be either stored inline or referred to thanks to an index, in which case they are really stored in a separate part known as the shared strings part. Because, as a consumer, you don't control how the string is stored, in practice if you interested in cell values you need to read the shared strings part, build the table of indexed strings, prior reading worksheet parts.
Shared strings partThe shared strings part simply indexes strings as a way to factorize strings that may be used more than once in the worksheets or other parts (chart title for instance). It makes the reading of BIFF12 harder since the shared strings part must be read prior reading other parts such as worksheets otherwise there are chances you won't be able to make sense out of string indexes you'll encounter. The shared strings part is also used to store rich strings, i.e. strings where more than one formatting style is applied, also known as formatting runs. While the XML-way to describe formatting runs is much like in Word, the BIN-way is much like in old BIFF, i.e. the raw string is stored, followed by the formatting runs which are 4-byte pairs {position, style} defining the style to apply from the current position up to the given position in the string (0-based). A consequence is that if you don't care about formatting runs, then reading the string is as easy as if there was not formatting runs at all. Here are examples of shared strings : // shared strings BIN : a - <sst count="1" uniqueCount="1"> 9f 01 (08) 01 00 00 00 01 00 00 00 <si><t>a</t></si> 13 (07) 00 nb formatting runs 01 00 00 00 len of string (number of Unicode characters) 61 00 string "a" </sst> a0 01 (00) // shared strings BIN : a,b, cc - <sst count="3" uniqueCount="3"> 9f 01 (08) 03 00 00 00 03 00 00 00 <si><t>a</t></si> 13 (07) 00 nb formatting runs 01 00 00 00 len of string (number of Unicode characters) 61 00 string "a" <si><t>b</t></si> 13 (07) 00 nb formatting runs 01 00 00 00 len of string (number of Unicode characters) 62 00 string "b" <si><t>cc</t></si> 13 (09) 00 nb formatting runs 02 00 00 00 len of string (number of Unicode characters) 63 00 63 00 string "cc" </sst> a0 01 (00) // shared strings BIN : a,b, cc with string a used twice - <sst count="4" uniqueCount="3"> 9f 01 (08) 04 00 00 00 03 00 00 00 <si><t>a</t></si> 13 (07) 00 nb formatting runs 01 00 00 00 len of string (number of Unicode characters) 61 00 string "a" <si><t>b</t></si> 13 (07) 00 nb formatting runs 01 00 00 00 len of string (number of Unicode characters) 62 00 string "b" <si><t>cc</t></si> 13 (09) 00 nb formatting runs 02 00 00 00 len of string (number of Unicode characters) 63 00 63 00 string "cc" </sst> a0 01 (00) // shared strings BIN : abcd, where bc is in red - <sst count="1" uniqueCount="1"> 9f 01 (08) 01 00 00 00 01 00 00 00 <si><.../t></si> 13 (19) 01 nb formatting runs 04 00 00 00 len of string (number of Unicode characters) 61 00 62 00 63 00 64 00 string "abcd" 02 00 00 00 formatting run 1 {pos=2, style=0} 01 00 01 00 formatting run 2 {pos=1, style=1} 03 00 00 00 formatting run 3 {pos=3, style=0} </sst> a0 01 (00) The following records are involved : //SharedStrings records public const int BIFF12_SI = 0x13; public const int BIFF12_SST = 0x019F; public const int BIFF12_SST_END = 0x01A0;
Styles partThe styles part, much like the shared strings part, is a dictionary of factorized formatting styles represented by indexes. The most basic formatting style is known as a XF style, represented by a 0-based index, which groups a number of formatting such as fill pattern, borders, and so on. Cell styles are variations of XF styles. On top of which lives new concepts in Excel 2007 such as table styles, and themes (themes are defined in a separate part, which is left in XML even in a .XLSB binary file). Here is an example of styles part, in XML, and then in BIN : // This is what a styles part looks like in XML <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main"> <fonts count="1"> <font> <sz val="11"/> <color theme="1"/> <name val="Calibri"/> <family val="2"/> <scheme val="minor"/> </font> </fonts> <fills count="2"> <fill><patternFill patternType="none"/></fill> <fill><patternFill patternType="gray125"/></fill> </fills> <borders count="1"> <border><left/><right/><top/><bottom/><diagonal/></border> </borders> <cellStyleXfs count="1"> <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/> </cellStyleXfs> <cellXfs count="1"> <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/> </cellXfs> <cellStyles count="1"> <cellStyle name="Normal" xfId="0" builtinId="0"/> </cellStyles> <dxfs count="0"/> <tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/> <colors/> </styleSheet> // This is how, after breaking the BIN part in records, you can match // records to the XML markup. // Note that record identifiers are the two bytes on the left, the associated record length // is surrounded by parentheses, and it's followed by the record content itself. <styleSheet> 96 02 (00) <fonts> e3 04 (04) 01 00 00 00 <font> <sz val="11"/> <color theme="1"/> <name val="Calibri"/> <family val="2"/> <scheme val="minor"/> </font> 2b (27) dc 00 00 00 90 01 00 00 00 02 00 00 07 01 00 00 00 00 00 ff 02 07 00 00 00 43 00 61 00 6c 00 69 00 62 00 72 00 69 00 </fonts> e4 04 (00) <fills count="2"> db 04 (04) 02 00 00 00 <fill><patternFill patternType="none"/></fill> 2d (44) 00 00 00 00 03 40 00 00 00 00 00 ff 03 41 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <fill><patternFill patternType="gray125"/></fill> 2d (44) 11 00 00 00 03 40 00 00 00 00 00 ff 03 41 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 </fills> dc 04 (00) <borders count="1"> e5 04 (04) 01 00 00 00 <border><left/><right/><top/><bottom/><diagonal/></border> 2e (33) 00 00 00 01 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 </borders> e6 04 (00) <cellStyleXfs count="1"> f2 04 (04) 01 00 00 00 <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/> 2f (10) ff ff 00 00 00 00 00 00 00 00 00 00 10 10 00 00 </cellStyleXfs> f3 04 (00) <cellXfs count="1"> e9 04 (04) 01 00 00 00 <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/> 2f (10) 00 00 00 00 00 00 00 00 00 00 00 00 10 10 00 00 </cellXfs> ea 04 (00) <cellStyles count="1"> eb 04 (04) 01 00 00 00 <cellStyle name="Normal" xfId="0" builtinId="0"/> 30 (18) 00 00 00 00 01 00 00 ff 06 00 00 00 4e 00 6f 00 72 00 6d 00 61 00 6c 00 </cellStyles> ec 04 (00) <dxfs> f9 03 (04) 00 00 00 00 </dxfs> fa 03 (00) <tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/> fc 03 (50) 00 00 00 00 11 00 00 00 54 00 61 00 62 00 6c 00 65 00 53 00 74 00 79 00 6c 00 65 00 4d 00 65 00 64 00 69 00 75 00 6d 00 39 00 11 00 00 00 50 00 69 00 76 00 6f 00 74 00 53 00 74 00 79 00 6c 00 65 00 4c 00 69 00 67 00 68 00 74 00 31 00 36 00 </tableStyles fd 03 (00) <colors> d9 03 (00) </colors> da 03 (00) </styleSheet> 97 02 (00) The following records are involved : //Styles records public const int BIFF12_FONT = 0x2B; public const int BIFF12_FILL = 0x2D; public const int BIFF12_BORDER = 0x2E; public const int BIFF12_XF = 0x2F; public const int BIFF12_CELLSTYLE = 0x30; public const int BIFF12_STYLESHEET = 0x0296; public const int BIFF12_STYLESHEET_END = 0x0297; public const int BIFF12_COLORS = 0x03D9; public const int BIFF12_COLORS_END = 0x03DA; public const int BIFF12_DXFS = 0x03F9; public const int BIFF12_DXFS_END = 0x03FA; public const int BIFF12_TABLESTYLES = 0x03FC; public const int BIFF12_TABLESTYLES_END = 0x03FD; public const int BIFF12_FILLS = 0x04DB; public const int BIFF12_FILLS_END = 0x04DC; public const int BIFF12_FONTS = 0x04E3; public const int BIFF12_FONTS_END = 0x04E4; public const int BIFF12_BORDERS = 0x04E5; public const int BIFF12_BORDERS_END = 0x04E6; public const int BIFF12_CELLXFS = 0x04E9; public const int BIFF12_CELLXFS_END = 0x04EA; public const int BIFF12_CELLSTYLES = 0x04EB; public const int BIFF12_CELLSTYLES_END = 0x04EC; public const int BIFF12_CELLSTYLEXFS = 0x04F2; public const int BIFF12_CELLSTYLEXFS_END = 0x04F3;
Comments partComments are apparently complicated to store in the workbook model. Although comments have their own part, they are not referenced anywhere directly. Rather the worksheet where one or more comments are supposed to be attached to reference a legacy drawing part, VML markup, which describes a complex graphical construct that, thanks to some weird magic manages to relate the shape to draw the comments. It's unclear to me how not only BIN consumers/implementers but also XML consumers/implementers are expected to work with comments in a meaningful way. Here is an example of comments part in XML and in BIN : // This is what a comment part looks like in XML <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main"> <authors> <author>Administrator</author> </authors> <commentList> <comment ref="C3" authorId="0"> <text> <r> <rPr><b/><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/><charset val="1"/></rPr> <t>Administrator:</t> </r> <r> <rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/><charset val="1"/></rPr> <t xml:space="preserve">_x000A_new comment</t> </r> <r> <rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/> <scheme val="minor"/></rPr> <t/> </r> </text> </comment> <comment ref="C5" authorId="0"> <text> <r> <rPr><b/><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/><charset val="1"/></rPr> <t>Administrator:</t> </r> <r> <rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/><charset val="1"/></rPr> <t xml:space="preserve">_x000A_a</t> </r> <r> <rPr><b/><sz val="8"/><color indexed="10"/><rFont val="Tahoma"/><family val="2"/></rPr> <t>noth</t> </r> <r> <rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/><charset val="1"/></rPr> <t>er</t> </r> <r> <rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/> <scheme val="minor"/></rPr> <t/> </r> </text> </comment> </commentList> </comments> // This is how, after breaking the BIN part in records, you can match // records to the XML markup. // Note that record identifiers are the two bytes on the left, the associated record length // is surrounded by parentheses, and it's followed by the record content itself. <comments> f4 04 (00) <authors> f6 04 (00) <author>Administrator</author> f8 04 (1e) 0d 00 00 00 41 00 64 00 6d 00 69 00 6e 00 69 00 73 00 74 00 72 00 61 00 74 00 6f 00 72 00 </authors> f7 04 (00) <commentList> f9 04 (00) <comment ref="C3" authorId="0"> fb 04 (14) 00 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 <text ...> fd 04 (49) 01 1a 00 00 00 41 00 64 00 6d 00 69 00 6e 00 69 00 73 00 74 00 72 00 61 00 74 00 6f 00 72 00 3a 00 0a 00 6e 00 65 00 77 00 20 00 63 00 6f 00 6d 00 6d 00 65 00 6e 00 74 00 03 00 00 00 00 00 02 00 0e 00 01 00 1a 00 00 00 </comment> fc 04 (00) <comment ref="C5" authorId="0"> fb 04 (14) 00 00 00 00 04 00 00 00 04 00 00 00 02 00 00 00 02 00 00 00 <text ...> fd 04 (49) 01 16 00 00 00 41 00 64 00 6d 00 69 00 6e 00 69 00 73 00 74 00 72 00 61 00 74 00 6f 00 72 00 3a 00 0a 00 61 00 6e 00 6f 00 74 00 68 00 65 00 72 00 05 00 00 00 00 00 02 00 0e 00 01 00 10 00 03 00 14 00 01 00 16 00 00 00 </comment> fc 04 (00) </commentList> fa 04 (00) </comments> f5 04 (00) The following records are involved : //Comment records public const int BIFF12_COMMENTS = 0x04F4; public const int BIFF12_COMMENTS_END = 0x04F5; public const int BIFF12_AUTHORS = 0x04F6; public const int BIFF12_AUTHORS_END = 0x04F7; public const int BIFF12_AUTHOR = 0x04F8; public const int BIFF12_COMMENTLIST = 0x04F9; public const int BIFF12_COMMENTLIST_END = 0x04FA; public const int BIFF12_COMMENT = 0x04FB; public const int BIFF12_COMMENT_END = 0x04FC; public const int BIFF12_TEXT = 0x04FD;
Drawings partDrawing parts are stored in XML, but there are interesting rules :
Table partThe Table part is a new concept in Excel 2007, adds major improvements to the List object concept introduced in Excel 2003. Whenever a Table object is created, it gets referenced in a worksheet as follows : // Excerpt from a worksheet part ... <tableParts count="1"> 94 05 (04) 01 00 00 00 <tablePart r:id="rId1"/> 95 05 (0c) 04 00 00 00 72 00 49 00 64 00 32 00 </tableParts> 96 05 (00) ... This directly related to a separate part stored in a parent tables folder known thanks to the relationship identifier (rId1) and the content of the relationship file (_rels/sheetxxx.bin.rels). Which brings us to the Table part itself : // This is what a table part looks like in XML <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <table xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main" id="1" name="Table1" displayName="Table1" ref="B2:D4" totalsRowShown="0"> <autoFilter ref="B2:D4"/> <tableColumns count="3"> <tableColumn id="1" name="Column1"/> <tableColumn id="2" name="2003"/> <tableColumn id="3" name="2004"/> </tableColumns> <tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/> </table> // This is how, after breaking the BIN part in records, you can match // records to the XML markup. // Note that record identifiers are the two bytes on the left, the associated record length // is surrounded by parentheses, and it's followed by the record content itself. <table xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main" id="1" name="Table1" displayName="Table1" ref="B2:D4" totalsRowShown="0"> d7 02 (64) 01 00 00 00 03 00 00 00 01 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff 06 00 00 00 54 00 61 00 62 00 6c 00 65 00 31 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff <autoFilter ref="B2:D4"> a1 01 (10) 01 00 00 00 03 00 00 00 01 00 00 00 03 00 00 00 </autoFilter> a2 01 (00) <tableColumns count="3"> d9 02 (04) 03 00 00 00 <tableColumn id="1" name="Column1"/> db 02 (3e) 01 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff 07 00 00 00 43 00 6f 00 6c 00 75 00 6d 00 6e 00 31 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff </tableColumn> dc 02 (00) <tableColumn id="2" name="2003"> db 02 (38) 02 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff 04 00 00 00 32 00 30 00 30 00 33 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff </tableColumn> dc 02 (00) <tableColumn id="3" name="2004"/> db 02 (38) 03 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff 04 00 00 00 32 00 30 00 30 00 34 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff </tableColumn> dc 02 (00) </tableColumns> da 02 (00) <tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/> 81 04 (28) 04 00 11 00 00 00 54 00 61 00 62 00 6c 00 65 00 53 00 74 00 79 00 6c 00 65 00 4d 00 65 00 64 00 69 00 75 00 6d 00 39 00 </table> d8 02 (00) The following records are involved : //Table records public const int BIFF12_TABLE = 0x02D7; public const int BIFF12_TABLE_END = 0x02D8; public const int BIFF12_TABLECOLUMNS = 0x02D9; public const int BIFF12_TABLECOLUMNS_END= 0x02DA; public const int BIFF12_TABLECOLUMN = 0x02DB; public const int BIFF12_TABLECOLUMN_END = 0x02DC; public const int BIFF12_AUTOFILTER = 0x01A1; public const int BIFF12_AUTOFILTER_END = 0x01A2; public const int BIFF12_FILTERCOLUMN = 0x01A3; public const int BIFF12_FILTERCOLUMN_END= 0x01A4; public const int BIFF12_FILTERS = 0x01A5; public const int BIFF12_FILTERS_END = 0x01A6; public const int BIFF12_FILTER = 0x01A7; public const int BIFF12_TABLESTYLEINFO = 0x0481; public const int BIFF12_SORTSTATE = 0x0492; public const int BIFF12_SORTCONDITION = 0x0494; public const int BIFF12_SORTSTATE_END = 0x0495;
Query Table partThe Query Table part defines a data source. A query table is never referenced in a worksheet part. Rather, it's an implicit field in a Table part, and it's only made explicit in the relationships file associated to the table (i.e. tablexxx.bin.rels). Here is how a Query Table part looks like : <queryTable name="Database1" connectionId="1" ...> bf 03 (20) 49 1a 10 00 10 00 01 00 00 00 connectionId 09 00 00 00 length of string to follow (in characters, not bytes) 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 31 00 query table name <queryTableRefresh nextId="5"> c1 03 (0a) 17 00 05 00 00 00 00 00 00 00 <queryTableFields count="4"> c7 03 (04) 04 00 00 00 <queryTableField id="1" name="ID" tableColumnId="1"> c9 03 (14) 10 00 00 00 01 00 00 00 query table field id 01 00 00 00 table column id 02 00 00 00 length of string to follow (in characters, not bytes) 49 00 44 00 query table field name </queryTableField> ca 03 (00) <queryTableField id="2" name="TB_Name" tableColumnId="2"> c9 03 (1e) 10 00 00 00 02 00 00 00 query table field id 02 00 00 00 table column id 07 00 00 00 length of string to follow (in characters, not bytes) 54 00 42 00 5f 00 4e 00 61 00 6d 00 65 00 query table field name </queryTableField> ca 03 (00) <queryTableField id="3" name="TB_AGE" tableColumnId="3"> c9 03 (1c) 10 00 00 00 03 00 00 00 query table field id 03 00 00 00 table column id 06 00 00 00 length of string to follow (in characters, not bytes) 54 00 42 00 5f 00 41 00 47 00 45 00 query table field name </queryTableField> ca 03 (00) <queryTableField id="4" name="TB_COUNTRY" tableColumnId="4"> c9 03 (24) 10 00 00 00 04 00 00 00 query table field id 04 00 00 00 table column id 0a 00 00 00 length of string to follow (in characters, not bytes) 54 00 42 00 5f 00 43 00 4f 00 55 00 4e 00 54 00 52 00 59 00 query table field name </queryTableField> ca 03 (00) </queryTableFields> c8 03 (00) </queryTableRefresh> c2 03 (00) </queryTable> c0 03 (00) The following records are involved : //QueryTable records public const int BIFF12_QUERYTABLE = 0x03BF; public const int BIFF12_QUERYTABLE_END = 0x03C0; public const int BIFF12_QUERYTABLEREFRESH = 0x03C1; public const int BIFF12_QUERYTABLEREFRESH_END = 0x03C2; public const int BIFF12_QUERYTABLEFIELDS = 0x03C7; public const int BIFF12_QUERYTABLEFIELDS_END = 0x03C8; public const int BIFF12_QUERYTABLEFIELD = 0x03C9; public const int BIFF12_QUERYTABLEFIELD_END = 0x03CA;
Connections partThe Connections part is a connection string to a data source. Unlike how Query Table parts relate to Table parts, Connections parts don't relate to Query Table parts by way of relationships (there is no querytablexxx.bin.rels file). Instead, a Query Table part has a connection id attribute in a <queryTable> element (BIFF12_QUERYTABLE record). Connections part files are made available at the workbook level, i.e. shared across all worksheet and related objects, which is the reason why Connections parts relate to the Workbook relationships (workbook.bin.rels). Here is how a Connections part looks like : <connections> ad 03 (00) <connection id="1" sourceFile="C:\Database1.mdb" keepAlive="1" name="Database1" type="5" refreshedVersion="3" background="1" saveData="1"> c9 01 (51) 03 00 02 00 00 00 51 00 09 00 05 00 00 00 01 00 00 00 01 00 00 00 00 10 00 00 00 43 00 3a 00 5c 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 31 00 2e 00 6d 00 64 00 62 00 09 00 00 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 31 00 <dbPr connection=Provider=Microsoft.ACE.OLEDB.12.0;User ID=Admin;Data Source=C:\Database1.mdb; Mode=Share Deny Write;Extended Properties="";Jet OLEDB:System database="";Jet OLEDB:Registry Path=""; Jet OLEDB:Engine Type=5;Jet OLEDB:Database Locking Mode=0;Jet OLEDB:Global Partial Bulk Ops=2; Jet OLEDB:Global Bulk Transactions=1;Jet OLEDB:New Database Password="";Jet OLEDB:Create System Database=False;Jet OLEDB:Encrypt Database=False;Jet OLEDB:Don't Copy Locale on Compact=False; Jet OLEDB:Compact Without Replica Repair=False;Jet OLEDB:SFP=False;Jet OLEDB: Support Complex Data=False" command="Table1" commandType="3"> cb 01 (81 09) 03 00 00 00 02 34 02 00 00 50 00 72 00 6f 00 76 00 69 00 64 00 65 00 72 00 3d 00 4d 00 69 00 63 00 72 00 6f 00 73 00 6f 00 66 00 74 00 2e 00 41 00 43 00 45 00 2e 00 4f 00 4c 00 45 00 44 00 42 00 2e 00 31 00 32 00 2e 00 30 00 3b 00 55 00 73 00 65 00 72 00 20 00 49 00 44 00 3d 00 41 00 64 00 6d 00 69 00 6e 00 3b 00 44 00 61 00 74 00 61 00 20 00 53 00 6f 00 75 00 72 00 63 00 65 00 3d 00 43 00 3a 00 5c 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 31 00 2e 00 6d 00 64 00 62 00 3b 00 4d 00 6f 00 64 00 65 00 3d 00 53 00 68 00 61 00 72 00 65 00 20 00 44 00 65 00 6e 00 79 00 20 00 57 00 72 00 69 00 74 00 65 00 3b 00 45 00 78 00 74 00 65 00 6e 00 64 00 65 00 64 00 20 00 50 00 72 00 6f 00 70 00 65 00 72 00 74 00 69 00 65 00 73 00 3d 00 22 00 22 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 53 00 79 00 73 00 74 00 65 00 6d 00 20 00 64 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 3d 00 22 00 22 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 52 00 65 00 67 00 69 00 73 00 74 00 72 00 79 00 20 00 50 00 61 00 74 00 68 00 3d 00 22 00 22 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 45 00 6e 00 67 00 69 00 6e 00 65 00 20 00 54 00 79 00 70 00 65 00 3d 00 35 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 20 00 4c 00 6f 00 63 00 6b 00 69 00 6e 00 67 00 20 00 4d 00 6f 00 64 00 65 00 3d 00 30 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 47 00 6c 00 6f 00 62 00 61 00 6c 00 20 00 50 00 61 00 72 00 74 00 69 00 61 00 6c 00 20 00 42 00 75 00 6c 00 6b 00 20 00 4f 00 70 00 73 00 3d 00 32 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 47 00 6c 00 6f 00 62 00 61 00 6c 00 20 00 42 00 75 00 6c 00 6b 00 20 00 54 00 72 00 61 00 6e 00 73 00 61 00 63 00 74 00 69 00 6f 00 6e 00 73 00 3d 00 31 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 4e 00 65 00 77 00 20 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 20 00 50 00 61 00 73 00 73 00 77 00 6f 00 72 00 64 00 3d 00 22 00 22 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 43 00 72 00 65 00 61 00 74 00 65 00 20 00 53 00 79 00 73 00 74 00 65 00 6d 00 20 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 3d 00 46 00 61 00 6c 00 73 00 65 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 45 00 6e 00 63 00 72 00 79 00 70 00 74 00 20 00 44 00 61 00 74 00 61 00 62 00 61 00 73 00 65 00 3d 00 46 00 61 00 6c 00 73 00 65 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 44 00 6f 00 6e 00 27 00 74 00 20 00 43 00 6f 00 70 00 79 00 20 00 4c 00 6f 00 63 00 61 00 6c 00 65 00 20 00 6f 00 6e 00 20 00 43 00 6f 00 6d 00 70 00 61 00 63 00 74 00 3d 00 46 00 61 00 6c 00 73 00 65 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 43 00 6f 00 6d 00 70 00 61 00 63 00 74 00 20 00 57 00 69 00 74 00 68 00 6f 00 75 00 74 00 20 00 52 00 65 00 70 00 6c 00 69 00 63 00 61 00 20 00 52 00 65 00 70 00 61 00 69 00 72 00 3d 00 46 00 61 00 6c 00 73 00 65 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 53 00 46 00 50 00 3d 00 46 00 61 00 6c 00 73 00 65 00 3b 00 4a 00 65 00 74 00 20 00 4f 00 4c 00 45 00 44 00 42 00 3a 00 53 00 75 00 70 00 70 00 6f 00 72 00 74 00 20 00 43 00 6f 00 6d 00 70 00 6c 00 65 00 78 00 20 00 44 00 61 00 74 00 61 00 3d 00 46 00 61 00 6c 00 73 00 65 00 06 00 00 00 54 00 61 00 62 00 6c 00 65 00 31 00 </dbPr> cc 01 (00) </connection> ca 01 (00) </connections> ae 03 (00) The following records are involved : //Connection records public const int BIFF12_CONNECTIONS = 0x03AD; public const int BIFF12_CONNECTIONS_END = 0x03AE; public const int BIFF12_CONNECTION = 0x01C9; public const int BIFF12_CONNECTION_END = 0x01CA; public const int BIFF12_DBPR = 0x01CB; public const int BIFF12_DBPR_END = 0x01CC;
Pivot table partI haven't been very far in the pivot table part area. What I can say however is that :
Printer settings partThe printer settings part is really just a binary dump of the WIN32 DEVMODE structure (see MSDN for more information). This information was stored in earlier Excel versions (97 and above) as the BIFF8 [PLS] record.
Index and Calculation Chain partsThose are caches that only serve the Excel run-time. They contain direct offsets to the content in worksheet BIN parts and are thus unsuitable for programming purposes. Fortunately, you can alter BIN parts without worrying of corrupting the workbook in this particular case, because it does not harm to leave the index and calculation chain parts unsynched.
A word on password-protected documentsWhile not strictly related to BIN file formats, if you happen to password-protect Excel 2007 workbooks, then the resulting file, no matter whether it'a .XLSX or not, will be encrypted using RC4 in an OLE container. The entire file is encrypted in the EncryptedPackage stream. Here is a screen capture of a password-protected workbook as viewed in an OLE document viewer :
Needless to say, password-protected workbooks are not expected to be used programmatically...
In which order should parts be read?
How does one get the value of a cell?When reading a worksheet part, individual cells are part of a block of records inside BIFF12_SHEETDATA and BIFF12_SHEETDATA_END. Excel stores cells row by row, meaning that there is a record which identifies a given row (including information such as the row style, height, whether it's hidden or not, ...), then follows an arbitrary amount of actual cell records identified by the value they store and whether or not their value is governed by a formula. For each cell record, the column is provided (including other informations). If a cell stores a shared string, then the value is obtained from the index by looking up the shared strings table. Objects such as hyperlinks, tables, chart, named ranges, pivot tables are defined on top of these cells and defined elsewhere, either at after the BIFF12_SHEETDATA block of the corresponding worksheet part, or in other parts (named ranges are defined in the workbook part, so that they can be shared across all worksheets, internal and external).
How does one get the style of a cell?As said previously, a cell stores formatting style information. Individual cells store a formatting style, a 0-based index in the styles table. In general, those styles are individual cell styles, and refer to the <cellXfs> collection of individual <xf> styles. In turn, each <xf> has an index to the following collections : number formats, borders, fonts, alignment, and fill pattern. Whenever the cell stores an inline rich string, or has an index to a shared string which in turn is a rich string, then the formatting style of the cell is defined by the formatting runs stored as part of the rich string. Each formatting run defines a style for a fraction of the text.
Final words, and links to the source code, againThe BIFF12 reader presented in this article, provides in the sample code below is a work in progress into the Office 2007 .bin file format which, as we have seen, encompasses a number of underlying file formats. The code provided in C++ and in C# is the basis of a read/write/manipulation library thanks to the fact that record handlers are really entirely responsible for reading/writing/manipulating the corresponding records (and that's why there are so many classes). To turn the existing source code into a real manipulation library, you'll have to create instances of individual records. For instance, in C# instead of doing this : // retrieve the associated record handler to the record identifier (h is a hashtable) BaseRecord recHandler = (BaseRecord) h[recid]; You'll have to do this : // create a record handler instance based on the record identifier (h is a hashtable) BaseRecord rec = (BaseRecord) Activator.CreateInstance(h[recid].GetType()); And of course implement a Write() method for each record handler.
My goal was to decipher most undocumented .bin file formats inside ZIP files, and come up with a way to read the values in the cells of an arbitrary Excel 2007 workbook regardless the file format. If you feel like augmenting the reverse engineering done so far such as the actual deserialization of the less important individual records, then feel free to do so and drop a line (I can merge your work into this source code). Download OLE read/write (C++) - 9 Kb |
Home Blog |