What is metadata?
Metadata is sometimes defined as ‘data about data’. A common analogy is a library catalogue (or museum catalogue) card which has summary information about an object on a shelf and points to that object’s location. However, while a reader may use the catalogue card as metadata, a librarian may use a collection of such cards as data: for example, to count the number of books on philosophy. In fact, nowadays, metadata may be used to characterise not only datasets but also software services, software modules, workflows, equipment (including computers and sensors), organisations, persons etc. In fact, metadata is becoming the digital representation of a real or digital object. Such metadata records are commonly stored in a digital catalog (note the changed spelling to differentiate from manual catalogue).
What are metadata standards?
Communities realised that if they each characterised their real or digital objects of interest differently, then they could not easily gain an overall picture of the real or digital objects of interest across collections. Furthermore, if an object were to be transferred (permanently or temporarily) from one collection to another, having different characterisations would be at best inconvenient and at worst erroneous. This is equally true of transferring information from one organisation’s IT system to another such as the SWIFT interbank network or the many supply chains that exist in manufacturing through to cooperation between law enforcement agencies and more. In all these cases standards have been defined and agreed. In the digital world, the ability to Find, Access, Interoperate and Reuse (FAIR Principles) digital assets depends on metadata following those principles.
It is commonly said that standards are a good thing and that is why there are so many of them! In many areas of commercial, academic and public life there exists a plethora of standards for metadata. Some of the most commonly used are DC (Dublin Core), CKAN (Comprehensive Knowledge Access Network), DCAT (Data Catalogue Vocabulary), ISO19115 (for geospatial information), MARC for libraries and particularly inter-library loan, HL7 for healthcare and CERIF (Common European Research Information Format). There are multiple standards in each domain of research and business, public services and education. Most have variants or ‘dialects’ as people have tried to improve upon the base standard for local purposes. These variants may be termed ‘application profiles’ and are not unlike a database view. However, they defeat the intent of interoperability unless agreement on a common core is reached.
How is Europe involved?
DC is rooted in the 1994 International WWW conference held in Chicago leading to a workshop in Dublin, Ohio in 1995. European researchers, dominantly from the WWW and library domains contributed and many conference and workshops have been held in Europe.
CKAN was developed by the Open Knowledge Foundation based in Cambridge, UK and adopted for example by UK Government.
The original DCAT vocabulary was developed and hosted at the Digital Enterprise Research Institute (DERI) in Ireland. It has strong advocates in Europe and is used particularly for European data portals as DCAT-AP. There are geospatial and statistical variants used for interoperation in those domains.
ISO19115 is – of course – international but in Europe there is an EU directive to use INSPIRE which is a variant of ISO19115 (usually as ISO19139 the XML encoded version).
MARC – developed in the USA in the 1960s - became the US standard for library catalogs in 1971 and internationally in 1973. Variants appeared based on countries. MARC21 is used in North America, UNIMARC in Europe.
HL7 dates from 1987 and is a US-based initiative now adopted worldwide. The ‘7’ refers to level 7 of the ISO model for Open Systems Interconnection.
CERIF (as its name implies) is a European initiative dating from the 1980s and revised significantly in the late 1990s to embrace a richer model with formal syntax and declared semantics. It emerged from an EC Expert Group. Although dominantly European there are installations in all continents (except Antarctica).
Why are they important?
To share information across computing systems and their services, the information being shared requires metadata to characterise it. A database schema sitting between a database and the software services accessing it is metadata (and here there are also many standards). This permits generalised homogeneous software services to be used across a multitude of heterogeneous databases. Metadata standards allow the building of distributed systems of data and software by characterising the information formats, types, default values, constraints etc. They are used to improve user interfaces by proposing values for input or by applying validation constraints. They may suggest broader or narrower terms for entry. They allow the characterisation and interoperation of workflows and the dynamic deployment of applications or workflows – possibly partitioned or parallelised - across multiclouds including Fog and Edge computing linked with the IoT (Internet of Things).
The most important feature is that metadata standards allow systems to be built with confidence knowing that the components can interoperate with each other as long as they read, interpret and write metadata according to the standard. This encourages the development of services (open source and commercial) utilising the metadata standard and thus creates a bazaar of offerings for the systems developer.
How are they used?
Some examples of use were given above to justify their importance. Essentially metadata standards:
- Allow decoupling of software services and specific data formats;
- Allow improved user interfaces with defaults and validation;
- Enhance the Findability, Accessibility, Interoperation and Reuse of assets FAIR);
- Allow ‘plug and play’ of services obeying the standard to construct systems;
- Allow dynamic (re-)deployment of workflows or applications across multiclouds;
- Ensure the integrity of systems by characterising the users, organisations, assets and relationships between them;
What are the problems?
The first problem is to choose an appropriate standard for the requirements. There are so many to choose from. In general, systems developers in a given domain of activity will know the standards used in that domain. A list of known and used metadata schemes for research is available at the RDA (Research Data Alliance) metadata standards catalog.
A second problem is to understand the metadata standard since many are documented in a way that is unhelpful. Commonly there are assumptions and restrictions which are not obvious.
The major problem is that many metadata standards were designed with a human reader rather than a computer system in mind – essentially following the library catalogue card. Typically, they have an identifier for the object being characterised, its location (URL) then a set of attributes describing it. The problem is that some of the attributes do not depend functionally on the object being characterised. A person as an author is not a dependent attribute of the article or book (unlike the title or the number of pages); she exists independently and has many other relationships e.g. with other assets, persons, organisations. If there are multiple authors then the problem of referential integrity appears. The key word in the sentence above is relationships. A person is related to a book in role author. Multiple persons can be related to the same book in role author (or more specifically, co-author). The term describing the role (here: author) is meaningless lexical string of characters to a computer unless somewhere the semantics of ‘author’ are defined; this is where thesauri and domain ontologies are used. There are plenty of homonyms: ‘bond’ may refer to chemistry, finance, sociology or entertainment. Ideally these relationships have temporal semantics so that the temporal range when the assertion is true is recorded. This has advantages for temporal queries such as the departments of an organisation at time t (an example of horizontal time) or the positions occupied by a person over time (vertical time). Furthermore, the inclusion of temporal semantics alongside the relationships with rich syntax and semantics provides built in provenance. An example of a metadata standard providing this is CERIF.
What is being done to improve?
With experience, users and system designers are realising the problems and addressing them. For example, DC has progressed in about 20 years from textual to HTML representation (the latter providing attribute headings) and thence to XML (allowing unqualified relationships but qualified values for attributes) and more recently to RDF (Resource Description Framework) which allows triples representing relationships of the form <subject><verb><object> e.g. <Mary><is author of><Book X>. Unfortunately, much use of DC is still using the older versions.
W3C has a Data Exchange Working Group (DXWG) improving DCAT. The RDA metadata groups under the umbrella Metadata Interest Group are working with groups from various subject domains to define an element set for metadata to be used in research where each element has internal structure including relationships and the elements themselves are linked by qualified relationships. This will provide sufficient richness of syntax and semantics and also flexibility for present and foreseen requirements.
Of course, metadata standards evolve with time (see DC example above). It is critically important that they preserve backward compatibility (to allow interoperation with legacy systems) and that there is a community supporting and developing the standard.
The topic of big data is of current interest. IEEE has set up (2017) a group led by Wo Chang of NIST to address the standardisation requirements including governance and FAIR metadata. Rebecca Koskela of DataONE, University of New Mexico and the author are in the group, partly to represent RDA.