When looking for a suitable repository, the discipline(s) a repository caters to is one of the most important pieces of information. The re3data Metadata Schema reflects this priority by requiring the element subject for all indexed repositories. However, describing the disciplinary focus of research data repositories is not an easy task. In this blog post, we therefore analyze the status quo of subject classification in re3data, and outline some options for going forward.
Currently, re3data uses an older version of the DFG Subject Areas. This classification comprises four levels. At the top level, there are four disciplines: Humanities and Social Sciences, Life Sciences, Natural Sciences and Engineering Sciences. The granularity increases with each level. Across all levels, there are 275 categories (see Figure 1).
Figure 1: Structure of the re3data subject classification
re3data has used the DFG Subject Areas for subject information for nine years. Subject information in re3data serves two main purposes: describing repositories in adequate depth, enabling re3data users to find a repository that suits their specific needs; and creating separation between repositories, enabling re3data users to distinguish repositories based on their disciplinary focus.
When describing a repository, re3data editors generally select the most specific categories that are applicable. Generalist or multidisciplinary repositories, for example Zenodo, are assigned abstract categories from the top level to indicate their broad disciplinary focus (see Figure 2A). In contrast, specialized disciplinary repositories such as the National Center for Atmospheric Research are described in more depth (see Figure 2B).
Figure 2: subjects assigned to Zenodo (A) and National Center for Atmospheric Research (B)
This approach ensures a solid description of the disciplinary affiliation of repositories in re3data, which is reflected in the fact that a notation at the third level of the DFG Subject Areas was assigned to approximately 86% of all repositories indexed in re3data (see Figure 3). If necessary, re3data users can gradually narrow down the results of a search query by selecting increasingly specific notations. Thus, in terms of depth, the DFG Subject Areas is well suited for re3data.
Figure 3: depth of notations by repository; each level is counted once
Using the DFG Subject Areas for describing research data repositories also creates sufficient separation. About two thirds of all repositories indexed in re3data are assigned notations from just one discipline (the four broadest categories of the classification; see Figure 4). In most cases, users therefore get a good idea of which discipline a repository belongs to. For example, the National Center for Atmospheric Research is only assigned notations from Natural Sciences (see Figure 2B). This assessment might, however, be less relevant for generalist multidisciplinary repositories.
Figure 4: number of disciplines from which a repository is assigned notations; each discipline is counted once
In terms of depth and separation, the DFG Subject Areas works well for re3data. However, this subject classification was not developed with the description of repositories in mind - its main purpose is the organization of DFG funding. As a result, using the DFG Subject Areas in re3data may sometimes cause friction.
You might already have noticed one of the biggest causes of friction when looking at the structure of the classification in Figure 1: the lack of detail in the Natural Sciences. Most notations re3data editors assign are from the Life Sciences and Natural Sciences (see Figure 5B), likely because most research data repositories cater to these disciplines. However, there are too few categories in the DFG Subject Areas that cover the Natural Sciences, especially compared to other disciplines (see Figure 5A). This mismatch between on the one hand, the available categories and on the other hand, the demand for categories is also exemplified by unused notations: 19 notations from Engineering Sciences are unused in re3data, compared to no unused notations from Natural Sciences.
Figure 5: proportion of notations by discipline of the DFG Subject Areas (A) and in re3data (B)
Another source of friction is that the original purpose of the DFG Subject Areas classification is to organize funding within the DFG, the largest German research funding organization. Accordingly, the focus of some categories is clearly placed on German and European contexts, for example in Literary Studies. Here, two out of four categories are dedicated to German literature, and only one category to European and American literature (see Figure 6). Other literatures are not explicitly included. This is a challenge for a global registry of research data repositories.
Figure 6: detail view of 105 Literary Studies and subgroups
We highly value feedback from the repository community and continually improve re3data. Therefore, we are exploring several options for addressing these issues going forward.
Firstly, re3data is already addressing known problems with the help of the keyword property. The lack of detail in Natural Sciences is partially remedied by routinely assigning certain keywords, for example “biodiversity”, “oceanography”, or “astrophysics”. re3data users can find repositories by searching for these keywords or using the filter function in the graphical user interface.
Secondly, we are aware that supplementing re3data records with keywords helps in selected areas, but does not solve the more underlying problems. Therefore, as part of the re3data COREF project, we are currently evaluating whether a different classification could be more appropriate for describing repositories in re3data, and whether a transition might be feasible with the given resources. This evaluation is ongoing, and we will share results with the repository community as we continue to develop our approach and vision for re3data.