Week 39: Have you learned to code?

a decidedly generous way of describing it

Dec 02, 2021

Last time I left off by offering that maybe we shouldn’t attempt universality with the vocabularies we create to organize information. Universality has gotten us into some tight corners and it’s not that useful. For example, tonight a friend told me that “everyone watches Seinfeld.” As long-time readers know, I have never watched Seinfeld, prompting me to retort, “who is your everyone?” Ask this question when you encounter sweeping statements and systems, whether they organize information, or allowances, or rules, or policies, or rights, or bodies. Who is your everyone?

Specificity is the alternative to universaltiy, which requires a bigger investment in resources. To prove that I’m not talking out of my ass and I do understand that building material-specific metadata is a big ask, I am now going to tell you about my own attempts to compose (or, cobble together) a controlled vocabulary for my Digital Libraries final project, which I submitted last night, a whole eight days early (from my two-and-a-half week extension).

The premise of the final project for Digital Libraries was to create a … digital library. There are a lot of technical elements to what actually comprises a digital library (not just image scans, but image scans that have been processed and organized in a specific way). For our purposes, we are concerned with how I’ve developed a controlled vocabulary to describe the materials in my project, creatively titled Another Pandemic Diary: Entries Across the Atlantic, January 2020-March 2021. The core material of the library (a collection), is scanned images of entries from my diary that I wrote during the first fourteen months of the Covid-19 pandemic, accented with photos I took during the same period.

From Bridget Jones's Diary. A shot of Bridget writing in her diary: "Diary of Bridget Jones, Spinster & Loner." — I rewatched *Bridget Jones’s Diary* last week and I don’t think it holds up as a Pride & Prejudice adaptation. The tendrils are only in the betrayal backstory between Colin-Firth-as-Darcy-yet-again/Hugh-Grant’s-character-whose-name-escapes-me. The rest is just…a different story, which is fine. Also, now seems to be as good a time as any to remind everyone that for a not insignificant portion of my life, I thought Hugh Grant *was* David Cameron, likely partially due to the 2004 tragedy, *Love Actually*. Image: giphy.

You might be thinking, “Shira, that seems like relatively low effort, whereas a lot of your classmates worked with materials that have been processed and are actually in libraries. And you just had your dad mail you your own journal?”1 Yeah, I didn’t have a lot of time this term to do my typical skulking around library holdings and acquaint myself with a new collection (which is unfortunate because just this week I worked with a collection that would’ve made a grand project. Someone remind me to write about Thielman in a few weeks), so I went with what I knew.

And that’s the first step of developing a controlled vocabulary: knowing what you’re working with. Of course, as we discussed in the previous issues, the less time-consuming but ultimately more detrimental approach is to just set categories and then try to fit squares into circles, but that’s not the purpose of this project. The premise was to create digital artifacts that are described with rich, item-specific metadata within a functional web platform.

To develop that vocabulary, I reread my diary four or five times with a critical, thematic eye, which was more illuminating than I would have imagined. It was fascinating to watch cycles of focus begin and end in ways I couldn’t see in the moment: I returned home from London in March 2020 and spent four glorious months with our beloved family cat, Pepper, before she died peacefully at 17 years of age. In May, I start to note her ailing health, and after a period of mourning in July, my attention moves to my own health, which was failing for the third time in a single year due to chronic Lyme (I spoke to my doctor on Tuesday and according to him, I’m doing “fantastically”). As I receive treatment and the worst effects dissipate, the entries recount my excitement to return to London, this time to work in a bookshop. And so on and so forth.

Sometimes, I deemphasize what likely should have been headlining events. Take December 24, 2021, for example:

Stormy tonight, the night before Christmas. Probably should’ve read a Christmas Carol—another time. Mom gave me books & feminist tea towels for our first set of presents.
Dad got hit by a car this morning—he’s okay, with a nasty scar on his forehead, but he’s eating & walking & talking…

There’s something morbidly funny about burying the lead there. For the record, I texted my dad to check that he was okay with me mentioning this, and he responded, “Sure, honored.”

I think this is from a Christmas Carol? Two fancy British men walking into a living room and saying "Merry Christmas to You!" — Basically the police officer when he knocked on our door to let us know my dad was in the hospital. Again, he is doing great. His scar looked like the New York Yankees logo for a while, which was ideal. At Thanksgiving, my cousin asked, “Who punched your dad?” and I was like “oh that’s from the car accident,” and everyone nods and my cousin was like,” What?? car?? accident??” IDK man! Not everything makes the family email thread. Image: giphy.

So the diary entries cover a wide swath of topics! And while I can easily systematize the date, time, and format of entries, as well as what language they are in and where they were written, the content scope becomes more difficult to qualify. I could not conduct optical character recognition on these entries, which would generate a searchable plain-text transcription of the text that I could merge with the image, because the OCR technology available to me cannot handle my handwriting. I was also unable to produce full-text transcriptions for each entry because I uploaded 98 entries to the site and transcribing each one of them wasn’t going to teach me anything. So I settled on synthesizing the content into keywords to make them searchable in my final website.

What does this mean? As I compiled metadata for each of my entries, I inputted subject headings and a free-text account of the item. The free-text account is a few sentences of explanation of the content and significance of the entry. The summary of subject headings is where it gets harder. Do I pull a point from each sentence? Each paragraph? The general gist? Specific figures? How do I format names? How will I keep track of all of the terms I develop? Moreover, how will users of this site want to search for entries? What entities will they type into the search bar?

After drafting a few lists of subject terms, I ultimately structured my vocabulary by dividing subjects into the following list:

locations and environment (e.g., geographic, architectural, natural),
events (e.g., quarantine, the 2020 election, weather),
figures (e.g., family members, friends, pets, and notable persons),
work (e.g., coursework, professional jobs, personal projects),
the emotional trials of the pandemic (e.g., grief, loss, healing),
activities (e.g. painting, reading, cooking, baking), and
media (e.g. books, movies, television shows)

This is a fancier noun list—person, place, and thing—but there was a critical mass of work-related and sad-girl-pandemic-related entries that they merited their own categories. Also, I watched a lot of movies. As I read through my scans, I let the entries tell me what to summarize, rather than make a pre-determined list. This means that I input subject terms like “family cat” and “depression” and “crickets” and “insurrection” and “library school” and “Marisa Tomei” (she comes up!) that fall within these categories. The free-flowing approach results in a somewhat long subject list, but I’d rather users have a broader sense of the content of the entries than a smaller, less overwhelming set of tags to work from. After all, how useful would it be for you to search “locations” and have nearly every entry appear with no disambiguation?

Marisa Tomei in "My Cousin Vinny" courtroom scene saying "No! It is a trick question." — Is knowledge organization inherent? Image: My Cousin Vinny, the greatest courtroom scene in film history, Pinterest.

In short, I’m employing a modified version of faceted classification, which was developed by S.R. Ranganathan, a mathematician who basically overhauled India’s library systems and created the five laws of library science. Ranganathan felt that classification systems should be built from the bottom up by examining recurring sub-categories present in a set of materials, grouping those together, and then building a scaffold around the groups. The benefits of his approach are specificity, flexibility, dynamism, and future-proof: faceted classification offers infinite hospitality that allows groupings and elements to change over time. If I added more entries to my website, for example, I could add more subject elements to the vocabulary list without disrupting any hierarchical categorization.

But determining the facets in the first place and the labor required to examine individual entities is monstrously difficult. Devising thoughtful ways to describe materials as they are, rather than what they resemble, is laborious. It’s nearly the opposite purpose of a controlled vocabulary—to ease the burden of responsibility from individual cataloguers to come up with ways to describe things. And, of course, if everything is described slightly differently, they often lack interoperability. Faceted classifications can become cluttered quickly and are not always intuitive to users (e.g. why did I list “fruit” in “locations and environment”?)

Faceted structures are ideal for small collections like mine, which can adapt easily and won’t grow much; for large, sprawling collections, it’s much harder to search and organize faceted structures. I spent hours reading through my own entries—material that had come from my brain! —to find common topics and to determine consistent reference points. Imagine how much more difficult that is with material you didn’t write or know little about. It takes a long time!

On the flip side, I resisted trying to find a controlled subject vocabulary for my materials because I knew no established controlled vocab would be able to talk about my cat and Marisa Tomei in the same breath. How do I talk about a eulogy I kept reading at the beginning of 2021? Or all the instantaneous, gorgeous things I saw on my weekend walks in Regent’s Park? No controlled vocabulary can speak to the tininess and greatness of an individual experience in that horrible first year of the pandemic.

It's also worth noting that for each item in my collection, I employ four different controlled vocabularies: three established ones for date, time, and location, and then my own fourth for subject terms. While controlled vocabularies aren’t usually mixed within a single element (that is, I’m not using two syntaxes to describe the date), a single item has lots of pieces of information in and about it that need to be described in different systematized ways. Ranganathan understood this, which is why he championed faceted classification to describe the infinite facets of an item (and how people interpret them differently). Last time, I pointed out two different kinds of classification for birds: one using taxonomy and one using vocalizations. Both of these schemes could live together in faceted classification.

Gif of bird entering a bird house on a tree; cuts to a bird pad with a living room inside the tree. — Live together…in a tree pad. Image: Pinterest.

While building metadata structures and controlled vocabularies might not seem like the most glamorous work of librarianship, especially for someone who entered the field through the handling and research of special books, or for someone who wants to work primarily with children and teen's collections, or anyone interested in the public-facing side of librarianship, good metadata strengthens all of that work. Completing this project built a deeper appreciation for how complex and crucial the process of metadata construction is, as well as how it is perhaps the most profound area of opportunity and improvement for just, accurate, and respectful representation of peoples in knowledge organization systems.

I obviously have the major benefit of being the author of these materials and knowing how I want them represented, and I still struggled to do justice to my own work. Choosing descriptors becomes that much more difficult the more distant you are from the text’s and how it could be read across people, time, and space. In 50 or 100 or 500 years, will people care if they can search for Marisa Tomei?

housekeeping and birdseeking

house

What I read this week: finally finished Transcendent Kingdom by Yaa Gyasi. Phenom.
What I’m currently reading: Afterparties by Anthony Veasna So. I think I’m going to have to reread this one to catch everything So is doing with the text.
Apologies to my partner, who was rightly disgruntled that I included a embarrassing gif of the Laker man getting booped in the face with the basketball (the Lakers are his favorite team). He said, “If I had a newsletter, I would never include a highlight reel of bad Roxane Gay takes.” Fair.

bird

See above gif.

More later.

I picked my diary because I have access to it 24/7, and lots of people wrote diaries during the pandemic, and my project could end up being a vaguely useful model for how institutions might compile, digitize, and describe a whole collection of pandemic diaries. Also, Daniel Defoe wrote one, so I’m in good company.

Dispatch from the Sunroom

Week 39: Have you learned to code?

a decidedly generous way of describing it

housekeeping and birdseeking