On Monday we soft released our new library website, digital repository, discovery layer, and initial library app portfolio. As expected, we are receiving valuable responses and reactions to these new library systems, some noting continuing challenges such as the discovery layer tendency, to well, "discover" discrepancies between what our users expect and how we indexed a MARC21 record into our Solr index. One such discrepancy is determining an item's format in one of our discovery layer's facet. For example, a microfilm may be incorrectly identified as a book.
Current Solr Indexing Practice
VuFind, Blacklight, and many commercial ILS products use Solr to index MARC records into their discovery layers for searching and accessing bibliographic records. Our discovery layer is built using Django as a web front-end to our Solr. Aristotle is our discovery layer's open-source project name.
Both VuFind and Blacklight use a Java-based open-source project called Solrmarc to index MARC records into Solr. Solrmarc offers multiple ways to determine a format of an item. In one method, a map properties file provides key and values for item formats that is called from a configuration. For more complex cases, Solrmarc offers a bean script option for conditional processing based on values in the target MARC record being indexed. These bean scripts can be complex and for the first versions of our Solr indexing in Aristotle, I used Solrmarc.
I take a different approach in the current version of Aristotle . I didn't want to maintain a completely separate code-base with a different programming language from our discovery layer. Fortunately, when I originally forked the Kochief codebase for use in Aristotle I could now do our MARC record indexing in Python.
Indexing with marc.py
In the marc.py is the name of the python module I'm currently using to index MARC records into Solr (you may notice if you click on that link, it takes you to a different open-source project I recently made available called py3-marc-indexer, a Python 3 version I created in attempt to deal with Unicode in MARC records, another subject worthy of its own blog posting).
Determining an item's format is the most heavily customized and longest function in marc.py and much of the credit for determining out this sometimes complex conditional logic is the work by a member of our cataloging staff, Christine.
Here is a short Python code snippet we use to determine that an item is a microfilm:
if len(leader) > 7:
if len(field007) > 5:
if field007[0] == 'a':
.
.
.
elif field007[0] == 'h':
format = 'Microfilm'
So, to translate this pseudo Python code, if the length of a MARC Record's leader is over 7 AND if the length 007 field is over 5 AND the first position of the 007 field is h, then we determine that an item's format is a Microfilm. This combination of rules uses the Library of Congress's MARC 21 Format for Bibliographic Data, descriptions for the Leader and the 007 fields.
While I believe the marc.py code is easier to understand and debug than the alternatives, I am still dissatisfied with the knowledge impedance between catalogers and programmers. I shouldn't expect catalogers to have to read python code to comment and contribute to these cataloging indexing rules. This is where Gherkin, a domain-specific language (DSL) can help.
Adding Gherkin to the mix
I was first introduced to Gherkin at the at the 2011 Code4Lib conference from Naomi Dushay's presentation. She explained how Stanford Searchworks evaluates search result relevancy using in part, Gherkin with BDD testing in their Blacklight Ruby-on-Rails discovery layer. I have already been using small bits of Behavioral-driven development practices in my coding in the FRBR-Redis Datastore project using Gherkin to create features and scenarios to test various aspects of the project, you can find more information here from my 2012 Code4Lib presentation.
Gherkin's natural language description of test conditions closely matches my desire to make it easier to communicate and collaborate with non-programing staff in the library about the explicit rules we are using to determine such things as an item's format. For example in this is the scenario taken from material_format.feature, describes the conditions that must be met before an item is considered a microfilm:
Scenario: The entity is a Microfilm
Given we have a MARC record
When "<code>" field "<position>" is "<value>"
| code | position | value |
| 007 | 0 | h |
Then the entity is a Microfilm
Using this approach, our MARC indexer, using the Python behave module, loads the material_format.feature and then uses these scenarios to determine the format for the Solr index.
Besides Gherkin being easier to understand, we can also use this same feature file for our BDD testing. Gherkin also opens a new way to engineer our MARC record batch processing of vendor MARC records as we describe what modifications we want to make to the MARC record for importing into our ILS and soon our FRBR-Redis Datastore instance.
Comments