Introduction

This material aims to provide a user's guide to the language data ontology being developed for use in the Language Data Commons of Australia (LDaCA) project (which includes the Australian Text Analytics Platform (ATAP)).

Metadata is often defined as ‘data about data’. High-quality metadata is important in making data FAIR:

  • Findable: Metadata is the starting point for searching data collections. For example, if we want to find data in a particular language, this will only be possible for data that has a language recorded in its metadata.

  • Accessible: Access conditions that apply to data should be part of the associated metadata.

  • Interoperable: Information about the format of data and whether it requires specific software to be usable should be part of the associated metadata.

  • Reusable: All of the aspects of metadata mentioned above contribute to making data reusable. The more we know about some data, the easier it is to know whether it will be useful to us or not.

RO-Crate Profiles

RO-Crates in general have basic metadata requirements, but it is possible to specify a profile for crates for specific purposes. LDaCA is developing such a profile for our data; we are basing this largely on previous work in the area. An important aspect of the RO-Crate approach is that it uses the principles of Linked Open Data. This means that terms used in our metadata will (whenever possible) link to an openly available definition. In developing the profile, we are drawing on existing attempts to provide vocabularies for describing data, particularly language data.

schema.org

Our general approach is informed by the various kinds of entities recognised in the ontology documented at schema.org, which is at least partly based on the RDF framework. In particular, we have adopted high-level entities which are part of the schema.org vocabulary, for example CreativeWork and Person.

Open Language Archives Community (OLAC)

OLAC is an international partnership of institutions and individuals; one of their activities is developing consensus on best current practice for the digital archiving of language resources and this includes making recommendations for metadata. The OLAC metadata scheme is based on Dublin Core (DC), a widely used general metadata schema. OLAC have suggested refinements and extensions of the DC base which make it more useful for describing language resources.

Entities in the ontology

Classes (rdfs: Class) are used to classify resources. An instance of an rdfs: Class is defined using the predicate rdf: type. For example, we have defined CollectionProtocol as a class and Man and Tree & Space Games is an instance of this class. Properties (rdfs: Property) are used to add attributes to classes. Similar to how we define classes, we can define instances of properties to add attributes to statements. In the example from earlier, we can add the property collectionProtocolType and give it the value ElicitationTask. ElicitationTask is a DefinedTerm. A DefinedTerm is a 'word, name, acronym, phrase, etc. with a formal definition' and they are 'often used in the context of category or subject classification.' DefinedTerms allow us a) to have accurate definitions of the values we want to give to properties and b) to group such definitions in DefinedTermSets which can function as controlled vocabularies. In our example, there is a DefinedTermSet CollectionProtocolTypeTerms which includes the DefinedTerm ElicitationTask.

Here is another example of the relationship between each of these entities:

Contributors to the current work include: Peter Sefton Simon Musgrave Nick Thieberger Marco La Rosa River Tae Smith Maria Weaver Rosanna Smith

Last updated