JSTOR's Metadata Story

I’ve worked in the content side of scholarly publishing for 25 years, and I cannot count the number of times I’ve been asked or overheard the question, “what is metadata exactly?” I always have a pretty visceral negative reaction to the description of metadata as “data about data.” Not because the definition is wrong (it isn’t), but because it is just not helpful. You’re always left with the follow-up thought, “OK, so what does that mean?”

At its best and most effective, metadata is a description of content and an expression of the goals that someone has for how that content can be used. It can be a lot more than that, but I submit that as a definition reflective of its most important use. Here’s an example:

<issue-article-meta>
    <article-id pub-id-type="doi">10.2307/40666632</article-id>
      <counts>
        <fig-count count="0"/>
        <page-count count="5"/>
        <image-count count="0"/>
      </counts>
</issue-article-meta>

Just as you see here, metadata is like an instruction manual for users of the content many of whom you have not met, and probably never will.

And therein lies a pretty big challenge. I would guess that most content creators can do a pretty good job describing what is important about their content and its intended use from their viewpoint. But what happens when you combine your content with someone else’s? Or how about when someone is consuming your content with a different perspective? We can mess up pretty badly when we assume that everyone is going to do things exactly as we would like them to, because nearly everyone is different. The aims of the content creator can vary quite significantly from those of their publisher, a platform provider, libraries, end users, or just about anyone else in the workflow.

So the trick with producing good metadata is doing your best to describe your content and all the things you think can be done with it, but not going so far as to limit the cool things that others might want to do with your content. Easy, right?

For JSTOR, a growing digital resource of journals and books as well as images, and other content, we try to strike the right balance, but also recognize that our metadata needs to evolve over time.

JSTOR was founded in 1997 as an online archive of journals, and its metadata strategy was very reflective of its reason for existence. The value of JSTOR was providing libraries with a complete digital archive of journals, so the metadata needed to prove that we had every page of every issue of every title, back to the beginning of that journal’s publication. This clarity enabled libraries to confidently deaccession titles, meaning they could remove the print copy from their stacks and trust that it would be available digitally from JSTOR.

To say that drove JSTOR’s metadata strategy might be an understatement. It was a way of expressing the value of what we were doing through our metadata, and the value of what we were doing was the trust that libraries had in us to protect and make that content accessible then and into the future.

The realization of that metadata strategy meant that we tagged all the descriptive metadata around articles – volume, issue, title, etc. – with great care and precision. Even when it wasn’t clear from the printed page, we employed Metadata Librarians to determine the best possible way to express this descriptive metadata, and we received appreciation from our library participants for those efforts. From the beginning and to this day, we have approximately 50 metadata fields describing each journal article or book chapter, with additional metadata fields at the title and journal levels.

This aspect of our metadata and our mission is still very important to us, but two decades later, if you don’t mind me saying, we’ve gotten pretty adept at expressing the descriptive value in our metadata beyond this basic set. We like to joke about our content management team “making the donuts.” But we also recognize that those are some pretty important donuts.

How has the world and our metadata evolved? In addition to our early metadata goals, we have added some, mostly because of the incredible growth of content on the web and best practices for describing it, and the tremendous growth of the JSTOR archive (now more than 10 million journal articles and 60,000 books). The challenge is that most people do not know exactly what they are looking for and, with this much content, it makes it harder for a full-text search engine to find that “just right” article or book chapter to answer a question, back up a research statement, or any number of things our users like to do.

So we’ve had to re-think our metadata a bit. Descriptive metadata (in other words, the facts about a piece of content like who the author is and the title) does some of the heavy lifting, but we determined that we needed to add some semantic metadata, or metadata that described what the documents were about, not just what they were. To address this, we built what we call the JSTOR Thesaurus that currently has around 50,000 concepts covering the broad subject areas included in JSTOR, which we have effectively associated with 5 million journal articles, book chapters, and research reports.

The beauty of that approach is that we are able to “normalize” concepts across all of the journals and books (if one journal uses the term “road” and another one uses the term “street,” users should see those results in a common search). It can get more sophisticated than that, and as we have launched specific content collections (around Sustainability and Security Studies, for instance), we have built out the thesaurus in these specific areas, creating richer semantic links to help users navigate this content.

We’re now expressing through our metadata the additional value JSTOR has delivered to libraries and their users: content discovery and use.

Developing semantic metadata has been a worthwhile next step in our evolution, but we know we are not done. Because good metadata, like good content, is constantly evolving. Whether it is increased descriptive metadata from our publisher partners (such as DOI’s, ORCID iD’s, etc.) or user-generated metadata that is growing in popularity, we are always on the lookout for things that will connect users with what we think is the great content within our archive.

It is also a big reason we are involved with the Metadata2020 collaboration. Good metadata is not something one organization can do on its own. Helping others, learning from others, and talking to everyone up and down the workflow of scholarly content will result in better, richer metadata for everyone involved, which will translate into better user experiences based on that metadata.

So the next time someone asks you what metadata is, just don’t say it’s data about data.

About the author

Jabin White is Vice President of Content Management for ITHAKA. With a heavy background in XML theory and practice, Jabin has spent most of his career evangelizing the benefits of markup languages and related technologies, including content management, workflow enhancements, and authoring tools.