Blog Post

TEI Header XML tags

TEI Header XML tags

The file Header provides the critical metadata needed to identify and cite the file. It is composed of five major components. The fileDesc provides all the necessary information needed to write a conventional bibliography. It in many ways acts a bridge from the old, analog, paper world to the digital. The following four tags are distinct from fileDesc in how they relate to the digital world. They are derivatives of features that the transition to XML has enabled. A great example of this is revisionDesc, which provides a history of edits to the digital file. This is good for the editing process of transcribing a file, but great for an ongoing project or new endeavor that needs constant iteration. In many ways it is reminiscent of a cut down version of Github commit history for computer programming. The encodingDesc tag provides the necessary background information and editorial decisions that were made for the transcription. It answers questions like, how were ambiguities resolved and if/how the text was normalized? The tag profileDesc provides subject information, the individuals who produced and in what context the document was produced in. The final tag is xenoData, which allows the transcriber to easily incorporate other types and forms metadata other than TEI.

The tags include in the William’s Poem tell us important information like who and where transcribed the poem and under what license. It even provided an email so that the original document creator can be easily contacted. It also includes information like the source, Google Books Scan, the original publisher, and a note recording that the document was partially proofread for accuracy.

The Header provides an easily accessible format to extract import identification information for the document. Separating the Header from the body removes key information like what actually is this document? More interestingly, a common criticism of digital texts is that it is relatively generic and devoid of context and history. These Headers remedy this by providing context of its production and history to the user, adding a new facet to text not easily implemented in physical books. I can also imagine the Header’s ‘strictly typed’ format makes it much easier for data science work to access and manipulate key information, expanding the possible number of use cases and ways to ‘read’ a text.

The question asked why the Header needs to be so lengthy supposes that the Header is long. It really is not. While there is an additional upfront work needed in typing a few more lines, the benefit gained from making this information easily accessible and indexable is essentially reaped in perpetuity. It is a question of deciding whether the fixed costs are worth having essentially zero marginal costs.

A robust header format has clear advantages. Now that I understand it, I am now curious about what projects have been achieved that were not possible before? What has it enabled and what can I do with it?

 

More information on this topic can be found here: http://www.tei-c.org/index.xml

83

No comments