Advantages of Shared Vocabularies

And then my friend was told to use a shared (common, public, standard) vocabulary.

XML grew out of SGML. SGML grew out of an effort to make a universal tag set for typesetting. (Yes, that is an oversimplification, but it is based on truth.) We couldn’t make a universal tag set because we figured out that this would be making a list of everything that ever mattered to anyone; a task that could not be finished. So instead a syntax for grow-your-own-markup-vocabulary was designed. Shortly after that, people started getting together in groups to make markup vocabularies for specific communities or uses. Some of those have been widely adopted (as recommendations and specifications, and even standards) and are at the center of communities of users. Some of the big winners in this game: DITA, HTML, NIEM, JATS, TEI, UBL, HL7, DocBook, NISO STS, BITS,…

There are also thousands of other vocabularies that are not as widely known, discussed, or adopted. Are they failures? Some are. I have written a few vocabularies that were intended to be widely adopted that have no users and never did. Those are failures. But there are others that are at the heart of a dozen or so businesses or activities. Are they failures? Absolutely not. They are enabling their users to create, validate, interchange, and use their content. We don’t need mega-success to have success. You can roll your own successfully. So what do shared vocabularies buy us?

The enthusiasm for some shared markup vocabularies is both charming and (sometimes) absurd. Useful and destructive. The impulse of people with similar concerns and needs to get together to create shared markup vocabularies and shared infrastructure to create, manage, and use documents tagged according to this shared vocabulary is admirable, and is the reason there are so many useful markup applications in use. If every user needed to define their own tag set, document it so it was used consistently among various document creators and over time, and write all of the tools needed to create, validate, edit, display, search, and archive their documents, few would make the investment and fewer would stick with it after the first or second bump in the road. Further, it would be virtually impossible to interchange marked-up documents; each organization that wanted to receive documents would need to develop a custom transformation for each document type they wanted to ingest into their system.

While a world in which each user defined their own optimal tags would be within the definitions of the XML specification, and some people reading the specification might assume that is what was intended, that world would be very inefficient, tag-set heavy, and isolated. Fortunately for all of us, except perhaps those who revel in vocabulary creation and coding transformations – that is not the world we live in.

Markup vocabularies also have traction in communities, partly because using a known vocabulary means that knowledgeable help with be available. If you are a scholar doing literary analysis you probably want to spend your time and energy on the analysis and recording of your documents, not on your infrastructure. You would be well advised to use TEI markup (Text Encoding Initiative academic tag set) partly because it was designed to do what you are doing, but maybe especially because tools, discussion lists, tutorials, and friendly advice are easily available if you do. And because your colleagues, who will review your funding proposals, journal articles, and your promotion and tenure applications, expect you to.