Thursday, October 18, 2007

Open XML File Format

The 2007 Microsoft® Office system introduces a new file format that is based on XML called Open XML Formats. Microsoft Office Word 2007, Microsoft Office Excel® 2007, and Microsoft Office PowerPoint® 2007 all use these formats as the default file format. Open XML formats are useful for developers because they are an open standard and are based on well-known technologies: ZIP and XML. Microsoft provides a library for accessing these files as part of the WinFX technologies in the System.IO.Packaging namespace. This SDK is built on top of the System.IO.Packaging API and provides strongly-typed part classes to manipulate Open XML documents.

Office Open XML (OpenXML) is an open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on different platforms. OpenXML is designed to faithfully represent existing word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Office applications. The reason for the need for OpenXML is simple: billions of documents now exist but, unfortunately, the information in those documents is tightly coupled with the programs that created them. The purpose of the OpenXML standard is to decouple documents created by Microsoft Office applications so that they can be manipulated by other applications independent of proprietary formats and without the loss of data.

Structure of an OpenXML Package

An OpenXML file is stored in a ZIP archive for packaging and compression. You can view the structure of any OpenXML file using a ZIP viewer. An OpenXML document is built of multiple parts. The relationships between the parts are themselves stored in parts. The ZIP format supports random access to each part. For example, an application can move a slide from one Microsoft Office PowerPoint 2007 presentation to another presentation without parsing the slide content. Likewise, an application can strip all of the comments out of a word processing document without parsing any of its contents.

The parts in an OpenXML package are created as XML markup. Because XML is structured plain text, you can view the contents of a part using text readers or you can parse the contents using processes such as XPath.

Structurally, an OpenXML document is an Open Packaging Conventions (OPC) package. As stated previously, a package is composed of a collection of parts. Each part has a part name that consists of a sequence of segments or a pathname such as "/word/theme/theme1.xml." The package contains a [Content_Types].xml part that allows you to determine the content type of all parts in the package. A set of explicit relationships for a source package or part is contained in a relationships part that ends with the .rels extension.

Word 2007 documents are defined using WordprocessingML markup. A document is composed of a collection of stories where each story is one of the following:

  • Main document (the only required story)
  • Glossary document
  • Header and footer
  • Comments
  • Text box
  • Footnote and endnote

PowerPoint 2007 presentations are described by PresentationML markup. Presentation packages can contain the following parts:

  • Slide master
  • Notes master
  • Handout master
  • Slide layout
  • Notes

An Excel 2007 workbook is described by using SpreadsheetML markup. Workbook packages can contain:

  • Workbook part (required part)
  • One or more worksheets
  • Charts
  • Tables
  • Custom XML

Reference:

www.msdn2.microsoft.com