Common sense at at last?

Good news everybody! It seems Marc21 is dead (or been told to order its last meal). Last week, the Library of Congress (LOC) working group on the future of bibliographic control has announced that:

  the Library community’s data carrier, MARC, is “based on forty-year-old techniques for data management and is out of step with programming styles of today.” [1]  The Working Group called for a format that will “accommodate and distinguish expert-, automated-, and self-generated metadata, including annotations (reviews, comments, and usage data.” [2]  The Working Group agreed that MARC has served the library community well in the pre-Web environment, but something new is now needed to implement the recommendations made in the Working Group’s seminal report. In its recommendations, the Working Group called upon the Library of Congress to take action. In recommendation 3.1.1, the members wrote:

“Recognizing that Z39.2/MARC are no longer fit for the purpose, work with the library and other interested communities to specify and implement a carrier for bibliographic information that is capable of representing the full range of data of interest to libraries, and of facilitating the exchange of such data both within the library community and with related communities.” [3]

…..

With these strong statements from two expert groups, the Library of Congress is committed to developing, in collaboration with librarians, standards experts, and technologists a new bibliographic framework that will serve the associated communities well into the future. Within the Library, staff from the Network Development and Standards Office (within the Technology Policy directorate) and the Policy and Standards Division (within the Acquisitions and Bibliographic Access directorate) have been meeting with Beacher Wiggins (Director, ABA), Ruth Scovill (Director, Technology Policy), and me to craft a plan for proceeding with the development of a bibliographic framework for the future.

Enjoy the whole thing here.

Such news honestly fills me with joy,  but I may need to reword some forthcoming talks. Lots of people have been tweeting and blogging, but Roy Tennant at Library Journal is surely allowed to celebrate the most, after all, he called for this nearly ten years ago.

The Last Supper for Marc21, hopefully with no resurrection in sight ...

Marc21 is more than a container format. Along with AACR2 (and RDA really) its a whole set of syntaxes, standards and working practices that represent a ‘transcriptive approach’ to metadata creation designed to generate a card catalogue record. This attitude has never worked to satisfaction in the networked environment and has given modern library programmers and hackers hours of pain.

Some thoughts about what may come to replace it …

Is Linked Data / RDF the right choice?

The LOC statement indicates a preference for Linked Data / RDF, but does not draw a distinction between the two. One is an idea, one is a syntax that can be used to encapsulate that idea. Still, RDF remains the most popular way of producing linked datasets.

Have Library of Congress made the right choice? Far to early to say. Its down to them to evaluate the tech, which is why they will be consulting. Some people will say that the LOC is a bit behind, and that linked data is a ‘has-been’ technology, a dead-duck. They may suggest some popular current tech alternatives such as:

  • Schema.org. / HTML5  Microdata formats. Right now, this is not really the same use case as Marc, although a Marc replacement should be able to easily translate into this sphere. In some respects, for cultural heritage and research, what Google is doing is almost immaterial, as the web exists and extends well beyond search and advertising  (and IMHO DuckDuckGo is generally a better search engine for many research purposes). Microdata right now is aimed at commercial applications and getting better sales links out there. A richer academic / cultural heritage application would be useful, but would need to be well adopted.
  • NoSQL databases are great for varied types of data and are a natural fit for bib data, but they are just database software, just as plain and simple JSON is a great container format and only that (ditto with plain and simple XML). Anyone using such tech as an excuse for unstructured data will find structure inevitably creeps in. One day, they may want to look for a schema or standard to help simplify things…

We in cultural heritage really need some level of schema and data structure to work with from the get-go, a base set of fields that have a well defined meaning and that are commonly understood by people on opposite sides of the globe doing the same job. We need some defined controlled way of filling these fields with text. In terms of subject and name authority control on a global scale, linked data has such obvious advantages that it needs serious consideration.

Then we can wrap them up in sexy JSON and load them into our funky mongodbs. Technology should not dominate the conversation here, but it should be seen in perspective. We have a lot more flexibility, choice and freedom than we did 40 years ago when Marc21 was created.

How does this tie into the major library system vendors?

Details on next-gen LMS systems are thin on the ground. Serials Solutions are apparently building a web scale management around linked data. Carl Grant has indicated that Ex Libris Alma has hooks for linked data, presumably URIs for record nodes, which seems a prudent choice. He argues that RDF linked data still needs to find its killer app. Maybe library management is it? Imagine records that catalogue themselves by following links in data to generate new access points …

OCLC have ideas in this direction and have been experimenting with linked data. Nothing much yet other than data, though.

This announcement may be timely for some development cycles, less so for others. I would suggest that LMS vendor takeup of any new standard in at least an import/export/creation capacity will be vital to product success as long as librarians still care about data standards. I could be wrong though.

UK experience with RDF / Linked data

The UK has a slight edge over the US, thanks in part to the initial work of the discovery programme. The British Library BNB is available as linked RDF and would arguably act as an ideal test platform for examining many of the issues that might arise during standards formulation. The Open Bibliography project has lead the way in exploring open licensing.

That the UK community has largely recognized the need for permissive licensing (CC0 / PDDL) around linked data is prehaps the main thing to shout about at this stage. When navigating links, coming up against a license wall stopping re-use could make life really difficult.

Do we need complexity?

One of the myths we really need to blow open is that libraries need and use rich and complex metadata even for everyday needs. We really don’t.

We need a baseline standard that is easy to understand for staff and readers, easy to implement and to get right. This will be easily sharable and useable outside of ‘libraryland’.

The evidence? According to OCLC Research only 10% of all Marc tags in Worldcat appear in 100% of all Worldcat records. 65% of tags appear in less that 1% of records. Basically, most of it is un-used. The standard is bloated. Think about all those meaningless icons in MS Word …

Extensible standards such as Dublin Core and flexible RDF vocabularies would allow for complexity to be included when needed and ignored when not, in a way Marc does not. To paraphrase Owen Stephens at a recent JISC event, an attempt to rebuild the Marc tagset in RDF whilst ignoring existing vocabularies would be an abject failure, along the lines of MarcXML (‘the worst of both worlds’).

How can we involve others?

Making the standard or approach useful to a wider community beyond ‘libraryland’ will be vital to its success. The statement seems to recognize this, but is it enough to leave its ownership in the hands of librarians and the LOC alone?

Karen Coyle is again the voice of reason, arguing again and again quite practically that if the Library of Congress want a truly useful open standard accepted beyond libraries, they need to open up its formulation, management and ownership to wider body. She draws attention to NISO’s offer to take ownership of the work.

I tend to agree, and hope the LOC steps back here. NISO know standards and how to manage change. Tying into this blogs’ emerging wider theme, its also a chance for everyone, (vendors, libraries and publishers) to bang heads and innovate on the same page. Interesting times ahead.

About these ads

8 thoughts on “Common sense at at last?

  1. Surely it would be better for W3C to manage the process…or alternatively for libraries to adopt W3C’s standards rather than insisting that their documents on the internet are different and special.

  2. In terms of container syntaxes, totally. No need to fork RDF or even simple Json (if the W3c ever adopt it). A proprietary standard or even misuse of a W3C standard would suck.

    How we use these high level concepts on a data content level needs a bit more thought though.

    As an example, MODS, a more modern library metadata standard uses W3C technologies (XML and XML schema), but it also needs its own guidance on how to use that schema and put data into it.

    http://www.loc.gov/standards/mods/.

    The fact that LOC still own Mods may explain its limited take-up outside of the library sphere.

    As I see it, metadata on the web has three components:

    1) The container format that wraps the data (RDF, RDF-XML, XML, JSON)
    2) The field and vocab choices expressed in the syntax (MODS, Onyx, Dublin Core)
    3) The way the content is expressed in that field (usually governed by a set of rules about data creation (RDA, AACR2, ISADG).

    Outside of libraries and archives, most people think a lot about the first layer and a bit about the second, rarely about the third. Full text searching and keyword indexing have diminished its importance to them.

    Part of the problem with Marc21 was that it was an unsatisfactory mix of all three. It was impossible to easily separate content and vocab from container without a lot of pain, partly as the format relies on punctuation to denote data elements.

    In terms of letting the W3C actually manage it it, would they want to? Do they govern how book publishers use XML-Onyx?, archivists use XML-EAD?

    They’ve already set up an incubator group on libraries and linked data which may well have impacted upon this decision:

    http://www.w3.org/2005/Incubator/lld/

    So in answer to your point, I think the w3C should totally own the top syntax layer (1) and give advice on the others, but its down to local communities to deal with (2) and (3). If there was a point to this post, its that libraries should not do (2) or (3) in isolation.

    • I take your point — I made a rather vague statement. I think my point was something more in this direction:

      RDFXML, Turtle or N-triples are “syntaxes” (or “formats”). (1) in your model.

      RDF is a model; it can be expressed in the different formats. You could also say that the syntaxes “realise” the model. (This is sort of 2–3 in your model as RDF is used both for the definition of descriptive terms and the data itself.)

      Where you draw a distinction between (2), the descriptive terms, and (3), the expression of content. This can be exemplified:

      :book bibo:pages “320 pp.”^^ex:ISBDNumberOfPages

      Personally, I’m not keen on this approach for two reasons: firstly, there is no added value over specifying a new datatype property specifying that that the value is the number of pages counted in a particular way. Secondly, there is the unnecessary use of textual values — as is done in the latter example — which is one of the major issues of catalogue data.

      There are two alternatives that I feel fit better with a traditional RDF model:

      :book bibo:pages “320”^^xsd:nonNegativeInteger.

      or

      :book ex:catalogueDataPages “320”^^xsd:nonNegativeInteger.

      You can develop a domain-specific vocabulary (assuming that dcterms:extent or bibo:pages won’t do for you), but I wouldn’t interfere with datatypes for values.

      In the latter case, the community produces a vocabulary which specifies what is being counted and how, but it doesn’t interfere with the formatting of the value. Here, more than anywhere, leave this to the W3C :)

      At the same time, I’d as whether it is really necessary to express all of this information in following ISBD, AACR*, FRBR — I am unconvinced, in fact I think they even harm the usability of the data.

      I’m also of the opinion that the benefits of networked, distributed data bring new possibilities that aren’t part of the conceptual space currently occupied by libraries. This has to change if there is to be any successful collaboration between libraries and the larger semantic web.

      In this connection, it’s interesting to note that you mention record-based formats like ONIX and EAD as these occupy the non-distributed space…like MARC. These heavy, standardized encodings where one size fits all (badly). As a colleague recently pointed out; we need dynamic standards that can be deployed quickly and efficiently. Here, I think RDF and the cluster of semantic web tech (particularly things like content negotiation-based versioning) can really do new, good stuff.

      At the same time, it’s also worth pointing out that we don’t need to choose the same “standard” in order to exchange information efficiently; in fact, I think the fact that anyone ever had the idea of applying MARC to music proves that sometimes we should dare to stuff differently.

      That’s a lot of text.

      I suppose my message here is — don’t develop new standards where perfectly good vocabularies exist. Rather follow best practice as it exists, following W3C recommendations. And please, please don’t drag the mistakes made when we moved off catalogue cards into the next iteration of the bibliographic adventure :)

      • Note that there is a task group looking at the treatment of various measures of extent in the newest guidelines, RDA, with an aim to convert the practice from text strings to actionable data. Thus “23 cm.” would presumably specify that “23” represents the height of the item in centimeters using defined properties. This group is called, in the usual obtuse mode, “CC:DA Task Force on Chapter 3.” It’s briefly explained here:
        http://www.libraries.psu.edu/tas/jca/ccda/tf-MRData1.html

  3. http://xkcd.com/927/

    Am in total agreement on best practice in RDF (if they go there) and W3C. I’m interested by your comment:

    “At the same time, I’d as whether it is really necessary to express all of this information in following ISBD, AACR*, FRBR — I am unconvinced, in fact I think they even harm the usability of the data.”

    Basically, 3) in my rough model, the rules governing the field content. What we used to call cataloguing rules.

    I can think of tonnes of examples where you are right (AACR2 allowing .hbk to sit alongisde an ISBN) and only adding death dates where two versions of a name exist.

    Do we need a newer, lighter set of transcriptive guidelines (beyond RDA) as well? Or should we just chuck the rulebook in the bin and hope that a decently structured container format and well chosen set of vocabularies will see us through?

    Outside of Librarianship and archives, few people bother to dream up rules about what exactly to put into fields. They tend to stop at fieldnames and call it a day. They may be right.

    • Karen, thanks for dropping by. Forgive my ignorance on RDA, but that sounds most welcome.

      Is date information considered in its remit? Its one of my major Marc21 / AACR2 bugbears.

      The ‘circa’ and ‘active around’ annotations to dates (100$d etc.) as well as the use of punctuation to separate birth and death dates is semantically inadequate and difficult for most developers to quickly work with. It also invites cataloguer error.

      As a result, the rich data often captured there by is rendered useless by inadequate markup.

      At the very least, this information should be held in a more granular capacity.

      • Date is a tricky one, and LC has been working on a new date standard that would include concepts like “approximate” etc. Basically, most of the info that is in the MARC date code or other dates should be expressable in the date standard. ISO 8601 includes intervals and repeating dates, but doesn’t include all of the date “situations” in library data. See: http://www.loc.gov/standards/datetime/spec.html. The draft is recent: Sept. 2011, so this is just proposed and I assume would go to one of the ISO TC-something committees.

        A better way to encode the reality of dates is going to be needed in the “cloud”, especially for things like events. I’d like to see some investigation of the actionability of this proposed standard — it doesn’t look terribly machine friendly to me.

        BTW, it’s not ignorance of RDA: I only learned about the Task Force because I know folks on it, and it’s a very small group (7 people). There is little openness in the RDA standards arena — some of it isn’t kept closed on purpose, but the groups working on it are not used to making their activities known to the wider community. I think there’s an assumption of “expertise” plus the reluctance to manage a broader dialogue. (Most people working on these standards have full time jobs, so they are pressed for time.) We have to find a way to break through that because the “small group of experts” approach just isn’t appropriate any more.

  4. I could get that tweeted and seen by at least a fair few software developers.

    They might need some contextual information explaining the use cases behind unknown dates.

    Or you could target a service like Stack Overflow and ask for feedback?

    Crowd-souring a lot of opinions would be as handy as speaking to experts who struggle to find the time.

    ! Strongly agree on the openness issue.

    With my laymans’ eyes, its not very machine friendly.. Continuing to mix in 19uu as alpha numeric fields for the level 1 + 2 extensions and keep varied information in a single field is going to be a real pain in the bum to read.

    Practically put, someone will need to develop and maintain code libraries to parse all of this, probably using a lot of regular expressions. This will need to happen for all popular software libraries.

    This barrier alone will require put people off using the data, especially if the communities that would build these libraries are not consulted initially.

    Still, I assume there are no existing standards out there to meet this use case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s