Metadata and Communication

Writing a blog post ‘about metadata’ is like calling a paramecium ‘a hungry slipper with fringe’ - you’re bound to miss out on a lot of detail. Metadata is the thing I got my library degree for, mostly because I think metadata is fascinating, but also because I loathed the reference desk and wanted regular hours. However, as my career went on (and on), I found the issues around metadata to be increasing as the supposed ease of discovery through electronic means became more universal.

Suddenly, wrong-headed delimiters and rage fits over subject headings were not our biggest problems.

So, for this first blog, I’m going to do a short skim over the main issues I think we are facing with regards to discovery and metadata, and maybe write more detailed posts for each issue as Metadata 2020 begins to sail along in earnest.

Metadata is communication. It describes and identifies an item, it defines relationships, it sets parameters for the range of actions something can engage in. It probably crochets copies of that ever-growing Linked Open Data graphic in its spare time and stores them in defunct Yahoo chatrooms, who knows. As the things we read, write, create, use, move around, and engage with become more digital, metadata becomes more important, because without metadata, all that discovery, movement, and interaction could not happen.

As a metadata librarian who is now a data curator in a non-library setting, enabling discoverability has always been the focus of my work, and yet, discoverability has become a fraught set of ideas that go far beyond my 1998 job of wrangling MARC 007 fields and applying LCSH or MeSH terms to a record. In what I’m told is an increasingly interdisciplinary world (is that really happening, or is it wishful thinking?), and an academic world where the words “open access” can lead to unseemly and highly amusing verbal brawls, metadata is being asked to become multilingual, multifunctional, and sustain digital kinship ties that would make anthropologists descend into fits of foulmouthed weeping.

Here’s the problem. Metadata is used for humans to communicate with humans, and for computers to communicate with computers, and for humans to communicate with computers, and vice versa.

Complexities of Language and Controlled Vocabularies

And then there are the semantics. The dirty, dirty semantics that mess us all up.

We humans are nuanced and weird, and computers are straightforward and dumb, and although AI is getting better, it’s still not up to the task of easily identifying complex meanings because we, as humans who build AI, are also pretty terrible at consistently identifying complex meanings. See: Romantic Comedies. In addition to that, language is under continual change, and our most widely used library/disciplinary controlled vocabularies are slow to catch up. If anybody doesn’t believe that, they need to look up some of the old LCSH or MeSH cataloging service bulletins announcing headings changes because they can be hilarious. Or not. Take, for example the MeSH vocabulary term “Abnormalities, Severe Teratoid".

This is the note on changes: 2010; see MONSTERS 1963-2009

People with severe abnormalities were tagged as ‘monsters’ until 2010. Mary Shelley and Professor X would not approve.

And What About AI?

The promise of AI and automatically generated and applied metadata is there, but there are some issues. Currently, there are tools that use batch comparison/contextual language, pattern recognition and machine learning methods to do automated metadata creation and assignment, but they operate at a fairly simple level, using a small, proscribed set of origin material from which to draw the metadata, and often need access to hierarchical ontologies that are well presented according to the OWL standard.

The results will only be as good as the metadata that the AI can harvest. Much of that metadata is missing, wrong or inadequate. Who has the resources to clean up the digity-universe? Also, there has been research done on the significant racial and gender biases in AI programming, which impacts how AI perceives digital objects and thus describes them.

To be useful, metadata must be open source and open access. If it is not, it cannot do its job (communicate!), and when automated metadata comes into play, it leads to less than fully described objects. Metadata also needs to have some level of documentation for humans to look at for assessment and application purposes, and we all know how enthusiastically everyone dives into documentation work.

The MARC Problem

The problem with MARC: it’s a library-ubiquitous, siloed, terribly human standard written to communicate with 1960s computers. If you are dealing with library metadata in any way, you must confront MARC.

The awesomeness of MARC: we have MarcEdit

The single most used tool in dealing with MARC is a free, constantly-developing tool called MarcEdit, created and maintained by Terry Reese of Ohio State University. Over the years, tens of thousands of catalogers all over the world have downloaded and use this tool every day, myself included. Here is an incomplete list of what MarcEdit can do:

Make MARC. Break MARC. Collate MARC. Prepare batch loads. Does regular expression, global editing, UTF-8 support and translations from other schemas. Generates reports and call numbers. Harvests things. Validates MARC data.

To summarize, one creative, generous guy created this tool in 1999 because he needed to do things with MARC, and now it is what everyone uses because he continues to update it to incorporate developments surrounding the MARC standard (RDA, XML, etc).

This is how far behind we are in dealing with our siloed, outdated MARC standard metadata. We’ve spent endless amounts of time and money trying to convert it into Linked Open Data, or XML, but still, we go back to our MARC and MarcEdit because the time and money involved in trying to do anything else in workplaces where cataloging departments are disappearing, is all but impossible. Our catalogs are inextricably tied into proprietary ILS systems where the cataloging metadata is also used in all other functions of the library - acquisitions, circulation, accounting, claims. The workflows are so intertwined that incorporating new metadata or new ways to acquire it is daunting.

Moving outside of libraries and their catalogs, we’ve got the ONIX publishing protocol, the Library of Congress subject headings and authority files, discipline-specific ontologies, VIVO, VIAF, MeSH, schema.org, wikidata, and all the ISO standards, to name a few. How do we manage all of this on top of the MARC problem to make discovery and identity possible?

And I didn’t even mention the schemas and standards surrounding the A-word (archives).

In my wilder moments I dream of a DMPTool, but for metadata. Pick your schemas, pick your standards, load a file and go. Is that even possible? I have no idea. But I would like one.

Until next time, keep your metadata open and your standards high!