UNIVERSITY OF GLASGOW

Effective Records Management Project

Structured Authoring in Microsoft Word

Pete Johnston

May 1998

Abstract

Document structure has traditionally not been made explicit. Rather, it has been conveyed to human readers indirectly, through the use of presentational conventions. Increasingly, there is a demand to be able to manipulate the component parts of documents using software tools, and an explicit description of structure is a prerequisite for such processing.

There exists an open standard, the Standard Generalized Markup Language (SGML), which provides for the description of document structure and there are an increasing number of "SGML-aware" authoring tools. However, the vast majority of electronic documents continue to be created in proprietary encoded forms using word processing and desktop publishing package where authoring has tended to focus on form rather than structure. These proprietary tools do often incorporate features which allow some structural information to be added to a document. The presence of that structural data may facilitate translation to other encoded forms, but it also brings benefits for the manipulation of data within the proprietary authoring environment.

The problems

1. Document structure

A description of the structure of an object identifies its component parts and the nature of the relationships between those parts.[1]

In the case of a document, perhaps the most easily identifiable structural characteristic is that it consists of a number of pages, or a series of volumes each containing a number of pages. This "physical" view of structure is helpful if we are interested in a document as an item to be shelved or bound or boxed, but less so if we view a document as "a collection of information processed as a unit".[2]

The discussion here is concerned primarily with the "logical structure" of a document, its division into components which are identified by their "functional" role in relation to the whole. A book may be divided into chapters and sections, an article into sections, which in turn contain sub-sections, paragraphs, lists, captions, footnotes, and so on. In the terminology of structured information, these "functional" classes of components are referred to as element types, each occurrence of a member of that class (e.g. an individual section or paragraph) being an element. If elements are to be referred to, they require unique identification, in the form of individual naming or numbering: e.g. "Chapter 1", "Section 3.1.3" etc.

Documents frequently display a hierarchical structure, with elements nested within one another. It is a device by which units of information can be created, and subsequently manipulated, at varying levels of "granularity": some processes operate on relatively high-level units like "chapters", others require a more detailed distinction between lower-level components.

Documents themselves can be classified as document types which share a common logical structure: all documents of the same document type contain the same element types, organised in broadly the same relationship to one another, though the content of those elements will be specific to the individual document. For example, all documents of the type "article" may contain a "title", an "author", and a series of "sections", each in turn containing a "heading" followed by a number of "paragraphs". Individual articles will contain variable numbers of sections containing variable numbers of paragraphs, and of course the text contained within those individual elements will be different. Some element types (like "title") may occur within all articles: they are a mandatory component of that document type.

2. Structure and form

Clearly, both authors and readers create and recognise these sort of structural components with ease, yet most documents contain no explicit description of structure: there are (usually) no labels identifying a piece of text as a heading or an author's name. The key to human readers' and authors' capacity to communicate structural information lies in the use of presentational conventions to convey cues about structure, and particularly about the identification of element types. The reader recognises that the piece of text which the author has centred and formatted in a large font size and bold face is the title of the article; the indented block of text is a quotation; the italicised string in a bibliography entry is the title of a cited work, and so on.

Changes in formatting within a document do not occur at arbitrary points. On the contrary, changes in presentation act as cues to the human reader to signal the beginning or end of a structural component of the document.

In simple cases, it may be possible to establish an unambiguous set of correspondences - every piece of text which has been centred and rendered in font Arial and size 18 is a main title; every piece of text in font Arial, size 14 and face bold is a section heading; and so on, to map every combination of formatting characteristics to an element type.

However in most documents, there is no one-to-one correspondence between presentational forms and structural components. For example, a piece of text in a paragraph may be italicised to indicate:

A human reader makes use of other (often quite subtle and complex) contextual information to differentiate between these uses, and so to establish structure from form. It would be almost impossible for a computer to do so. Form is determined by structure, but structure can not be inferred from form.[3]

3. Why does structure matter?

The close association between form and structure was not a problem while the principal goal of creating an electronic document was to render text for output on paper and there was no expectation that there would be any automated processing of components within that document.

However in the publishing industry, a document typically underwent a considerable degree of automated processing between its authorship and its rendition on paper. Now, the volume of documents stored and distributed in electronic forms (especially via the WWW) dictates that there is a growing need to be able to perform more sophisticated processing (indexing, searching, abstracting, rendering in new media and new encoding formats) in ways which require that the processing applications can access structural data about those documents. Even basic navigational tools like tables of contents rely on the capacity to distinguish elements like section headings from surrounding text. Consider the element types noted in the previous section as being identified by italicisation: it may be necessary to compile an index which includes technical terms, but not foreign language expressions, and that can not be achieved on the basis of the form of those elements.

More generally, having been presented with an unparalleled capacity to exchange information in document form, users want the ability to process that information as they would the contents of a database.[4]

4. The creation of structured documents

The structuring of a document requires a particular approach by the document creator. Indeed, it involves some extra thought and work at this stage. The purpose of this extra work, in the form of adding explicit structural information, is to "add value" to the document: it is an investment of labour which should be repaid in the form of making that document's information content more usable.

There exist technologies and tools which are designed for the creation and manipulation of documents in a form which emphasises structure rather than form. Information about the form of text, its presentational characteristics, tends to be proprietary to the particular software with which a document is being edited or viewed. The information which tells WordPerfect to display a piece of text in Arial font size 14 is quite different from that which tells Microsoft Word to do so, for example. In contrast, structure, the description of what an element is (rather than what it looks like), can be described in ways which are not dependent on processing by proprietary software, and there exist standard encoding formats designed for this purpose.[5]

However, it is recognised that the vast majority of electronic documents will continue to be created by simple text editors and proprietary word processing and desktop publishing packages. Within many of these packages, there do exist features which, if applied appropriately, can be used to create a description of structure.

One reason for encouraging a structured approach to document creation within a word processing environment is that it facilitates automated conversion to other digital forms.[6] The principle that documents have structure is the basis of encoding formats such as the HyperText Markup Language (HTML), the publishing language of the World Wide Web, and the generation of an HTML-encoded version of a document is made considerably easier if the source document itself is well structured. Even without such conversion, however, the structuring of a document allows both the author and the user to take advantage of some of the possibilities offered by the presence of explicit structural data within the proprietary authoring environment.[7] The discussion which follows applies to features available within Microsoft Word.[8] It is not intended as a comprehensive account of the techniques available but rather an indication of the possibilities.

5. Word styles and element types

A Word style is a named collection of formatting characteristics such as font, face, paragraph indentation and so on, which can be applied either to a paragraph or to a string within a paragraph. The style feature is designed as a shortcut to enable the document creator to save and reuse such a group of formatting characteristics which is applied repeatedly throughout a document.

However, formatting is not applied to parts of a document arbitrarily, but rather in order to provide a means of identifying the logical components of the document. A document creator applies a particular Word style repeatedly to selected pieces of text because those pieces of text are structurally similar - they are of the same element type. Through the simple device of giving the style a name which corresponds to the element type (e.g. "Section Heading") rather than just the presentational characteristics (e.g. "Big and Bold"), then the document creator has taken an enormous step towards adding to the document useful data about logical structure.

A document which applies Word styles in order to identify logical structure rather than simply as a formatting shortcut will probably employ a larger number of named styles. For example, with the four different structural element types by italicised text, the presentational effect could be achieved through the blanket application of a single "Italic Text" style. However, to distinguish structurally the four different element types, four distinct styles would be needed, "Term", "Non-English", "BookTitle" and "JrnlTitle", even though they would all contain exactly the same formatting characteristics.[9]

6. Word templates and document types

Word allows for a collection of styles (together with other Word constructs) to be stored together as a template, so that they can be recalled for reuse in the creation of other documents. If a Word style corresponds to an individual element type, then a template corresponds to a document type. For example, if all "memo" documents contain a standard set of component parts and should be formatted in the same way, then those component parts will correspond to a set of named Word styles ("Addressee", "Sender", "Date", "Title", "Text" etc.). That set of styles can be saved in a template called "Memo" which can then be reused for any subsequent document of that type.[10]

In this way, a template serves not only as a stylistic, presentational model, but as the basis of a structural model also. If a document author finds themselves needing to apply formatting which is not provided for by a style within the template, it perhaps indicates that they have encountered a structural component which was not foreseen by the designers of the template and which requires the addition of a new Word style to accommodate that element type. In the course of composing this document, for example, a new character style was introduced to describe "Non-English" expressions.

7. Benefits of structured authoring in Word

7.1 Consistency of form and of structure

The use of Word templates in general assists in the implementation of institutional standards for the form or "style" of common document types - it makes document creation quicker and provides for consistency of formatting.

The use of style names which identify the components' structural role rather than their appearance facilitates the implementation of structural standards (i.e. the specification of the elements which a document type should contain) as well as presentational ones - though Word can not offer automatic validation of structure.

7.2 Potential for reformatting document components

Using style names to identify element types also introduces a degree of separation between content and form. For example, it allows me to present one of my four italicised element types ("Term", "Non-English", "BookTitle", "JrnlTitle") as bold face instead, simply by changing the specification of that one character style while leaving the others unchanged.

More usefully perhaps, it allows the author to reformat a whole document to give the same content and structure a completely different appearance or "style", simply by attaching a second template. The second template contains a second collection of Word styles with the same style names as the initial one, but with quite different formatting characteristics associated with those style names. Given that one formatting option which can be specified in a Word style is the ability to "hide" the styled text, I could attach a template which renders the document in some sort of "summary form" by displaying only "section headings" and "key points" and hiding "item text".

In the HTML version of this document, the links which follow provide access to two RTF versions of this same document, each having exactly the same content, and with styles of the same name applied to that content, but their appearance is quite different because the styles exist in separate templates and have completely different formatting characteristics associated with them:

7.3 Potential for processing document components

Element-specific processing is now possible. Using the usual Edit-Find menu function within Word, it is possible to search for text in a named style, and since style names reflect the styled text's logical identity rather than its appearance, searching becomes available on the basis of that logical identity. I can search for any text in a named style, and thus have the option of creating element-specific indices (e.g. separate indices for personal names, place names etc.). Any named paragraph styles - any structural components or element types - can be used as the basis of an automatically generated Tables of Contents.

Much more sophisticated processing could be performed by application-specific Word macros or by application programs reading a Rich Text Format (RTF) version of the document. RTF was designed as a format for transferring data between Microsoft applications, but its specification is made public and so it can be used by user-written application programs. Since the information in an RTF-encoded document includes the names of the styles used (and not just the formatting characteristics associated with those styles), structural data is available to an application.

Finally, as noted above, styles may provide the basis for an automated conversion to an SGML- or HTML-encoded form of the document, or indeed to some other proprietary encoded form.

8. Limitations of Word techniques for structured authoring

8.1 Nesting restrictions

Firstly, Word allows only two levels of "nesting" of styles - a character style may exist within a paragraph style, but character styles themselves can not be nested. In practice, judicious naming of the styles can expand these possibilities so that nesting is at least logically implicit to the reader and to a processing application. For example applying a series of paragraph styles named "ItemHead", "ItemText" and "ItemConclusion" suggests that they are sub-components of an "Item" component, even if the latter does not exist as a Word style.

8.2 Validation of structure

Perhaps more significantly, Word does not include the facilities to perform rigorous validation of a structured document against a predefined structural model for that document type.[11] So, there is nothing to prevent an author from placing a paragraph styled "Main Heading" and containing the text "The End" at the end of a document simply because they like the presentational results of doing so, even if it contravenes the structural model the template designer may have had in mind (and, in this case, good sense!)

Similarly, in the case of the four distinct uses of italicisation: the same appearance of the document could have been achieved by applying the Word style "Term" to technical terms, non-English expressions and titles of works in the bibliography, but the logical distinction would have been lost.

8.3 Unique identifiers

While the use of styles allows for the "generic identification" of elements as belonging to a particular element type (i.e. this is a "section"), there is no real provision for the unique identification of individual occurrences of an element type, except to include it within the content of the element itself. The only way of labelling a section as "Section 3.2", for example, is to include that information in the text of a section heading. This may be adequate for the human reader, but for processing it is often more useful to store such an identifier as an "attribute" of the element, a value "attached" to it, rather than part of its content.

9. Conclusions

The use of Word for structured document creation does place new responsibilities on the document creator. The introduction of these techniques as part of an information management system must be supported by adequate education and documentation on the part of the system designer, not only in the use of the Word features, but in the principles behind applying them to the structuring of documents in general, and to the document types with which the author is concerned in particular. The distinction between logical components and their physical representation, between structure and form, is perhaps not one which the users of "what-you-see-is-what-you-get"-oriented authoring tools are accustomed to having to make, but it is critical to the successful implementation of these techniques.

I would argue, however, that this idea is not so much new or difficult, as one with which authors are unaccustomed to dealing consciously. Even in an approach to document creation which emphasises presentation rather than structure, presentational techniques are not applied arbitrarily to sections of text. The author, perhaps without recognising that they are doing so, decides first what an element is, and then on that basis decides what it should look like. In effect, the structured approach to document creation asks the author to make only the first of those decisions! Once a template has been created, the author selects the element type and applies the correspondingly named Word style, and as a consequence formatting is applied to that component automatically (and consistently across every occurrence of that element type).

The benefits derived from adopting a structured approach to document creation in Word, however, are not limited to ease and consistency of formatting - though these alone are considerable. The components of the document identified by structural information are now available for processing. Indeed, that structural information is a prerequisite for any such manipulation of a document's information content. That processing may be as simple as applying a new set of formatting characteristics to a selected element type, or it may encompass a complex series of actions to be performed on component parts of a document or a collection of documents, within the Word environment or outside it.

Such processing may be part of a formal document management system where documents are shared amongst a large user group and translated between a variety of delivery formats, but the application of exactly the same principles brings benefits to the document author and user working with perhaps a small set of documents in a single proprietary format, perhaps in relative isolation from other document users.

Bibliography

DeRose, Steven J., "Structured Information: Navigation, Access, Control", in Proceedings of the Berkeley Finding Aid Project Conference. April 4-6, 1995. (Berkeley Finding Aid Project, 1995). Available at http://sunsite.berkeley.edu/FindingAids/EAD/derose.html

Goldfarb, Charles F., The SGML Handbook. Edited and with a foreword by Yuri Rubinsky. (Oxford University Press, 1990).

Johnston, Pete, Document Structure (University of Glasgow Effective Records Management Project Discussion Paper, May 1998). Available at http://www.gla.ac.uk/Staff/Committees/Inf/ERM/Docs/docstr.htm

Johnston, Pete, Structure and Markup (University of Glasgow Effective Records Management Project Discussion Paper, May 1998). Available at http://www.gla.ac.uk/Staff/Committees/Inf/ERM/Docs/strmar.htm

Kennedy, Dianne, "Tales from the Front: Understanding Structured Documents", <TAG> : the SGML Newsletter Vol 9 No 2 (Feb 1996). Available at http://tag.sgml.com/09020601.htm


[1] For a more thorough discussion of document structure, see Johnston, Pete, Document Structure (University of Glasgow Effective Records Management Project Discussion Paper, May 1998).

[2] This is the definition of document employed within the SGML standard. See Goldfarb, Charles F., The SGML Handbook. (Oxford University Press, 1990), p 263.

[3] See DeRose, Steven J., "Structured Information: Navigation, Access, Control", in Proceedings of the Berkeley Finding Aid Project Conference. April 4-6, 1995. (Berkeley Finding Aid Project, 1995).

[4] For more discussion of the benefits of structuring documents, see Johnston, Pete, Document Structure (University of Glasgow Effective Records Management Project Discussion Paper, May 1998).

[5] For more discussion, see the section on the Standard Generalized Markup Language in Johnston, Pete, Structure and Markup (University of Glasgow Effective Records Management Project Discussion Paper, May 1998).

[6] Word styles provide the basis for element identification used by several commercially available tools for conversion from Word documents to SGML-encoded forms, including INSO's DynaTag (part of their DynaText SGML delivery package).

[7] For a distinction between the benefits of structured document creation and the creation of documents in SGML-encoded forms, see Kennedy, Dianne, "Tales from the Front: Understanding Structured Documents", <TAG> : the SGML Newsletter Vol 9 No 2 (Feb 1996).

[8] This discussion is based on features available within Microsoft Word 6.0 and Microsoft Word for Windows95, and should be read in association with the appropriate user guide or online help information. These features are also available in Word97 and I would predict that Microsoft will enhance the functionality for manipulating components within documents in future releases of Word.

[9] If you download the RTF version of this document and view the Styles applied to different parts of the text, you will see that these are the actual Word Character Styles used.

[10] This document was created using a template for a "report" document type, which provides for element types such as "BodyHead", "BiblEnt" (bibliography entry), "Authors", "Publisher", as well as various levels of "ItemHead", "ItemText" etc.

[11] There is an SGML Authoring Add-On available for Microsoft Word, which provides precisely this capability.


RTF version of this document

Effective Records Management Project Docs

Effective Records Management Project


Prepared by: Pete Johnston (Effective Records Management Project)

Last modified on: Friday 5 June 1998

Version: 1.0