Methodology for DMPs analysis

Silvio Peroni, Giulia Caldoni, Bianca Gualandi, Mario Marino, Sara Coppini, francesca.masini

Published: 2023-07-14 DOI: 10.17504/protocols.io.n2bvj87jpgk5/v1

Abstract

All eyes on data: Unleashing the untapped potential of research at the University of Bologna

Led by data stewards at the University of Bologna, this preliminary protocol was developed within an analysis of research data generated and managed within the institution with respect to the differences and commonalities between disciplines and potential challenges for institutional data support services and infrastructures. We are primarily mapping the type (e.g., image), content (e.g., scan of a manuscript) and format (.tiff) of managed data, thus sustaining the value of FAIR data as granular resources.

The analysis is based on data management plans (DMPs) produced by grantees of Horizon Europe and Horizon 2020 funding who are affiliated to the University of Bologna and are either project coordinators or partners in charge of the DMP. We are including in the study only the DMPs shared with us between May 2022 (when the team was created) and October 2023.

In short, we have selected 23 variables of interest to be headers of a table that is progressively filled with information garnered through a close reading of the DMPs. Computational analysis (R version 4.2.3) on the collected data will produce graphs showing composition, relationship (bar graphs, pie charts and alluvial/sankey charts) and incidences (waterfall graph) of the different variables. The data and the software used will be published openly. 

Before start

Before reusing this methodology, choose:

  1. What do you mean by 'data', i.e. the object of analysis of this research as described in the data management plans on which it is based. We have chosen to consider "data" all research outputs that are digital (thus excluding physical and intangible research outputs) distinct from publications. This choice comes from the source materials on which the research is elaborated: DMPs of EU competitive projects.
  2. Which taxonomies to use to define the possible values of the fields/variables of the analysis . We tried to reuse generalist and existing taxonomies whenever possible, but for three fields (creator unit, associated project unit, subject area) we chose to consider taxonomies defined for UniBO (list of departments and disciplinary areas of research).
  3. A computational analysis tool . We chose R in the 4.2.3 version.

For more information on the choices we made, please see section "guidelines".

Steps

Data collection

1.

Using the DMPs and GAs of European projects as input, we structured the table in which to collect data information with the following variables or fields and their meaning or accepted values:

  • Project identifier ( project_id ): alphanumeric string to identify the project to which the described data belong
  • Dataset identifier ( dataset_id ): alphanumeric string to identify the dataset to which the described data belong
  • Entry identifier ( entry_id ): alphanumeric string to identify the data category (i.e., file) described in the current row
  • Creator's unit ( creator_unit ): research unit (department, centre, etc.) of the principal investigator who created or reused the dataset (TBD and NA are also accepted)
  • Project unit ( project_unit ): research unit (department, centre, etc.) of the principal investigator of the project
  • Project programme ( project_programme ): HE (Horizon Europe); H2020 (Horizon 2020)
  • Project type ( project_type ): individual; consortium
  • Subject area ( subject_area ): disciplinary or thematic area to which the project belongs
  • Month DMP is delivered ( month_dmp ): e.g., M6 (sixth month), M12 (twelfth month), etc.
  • Public DMP ( public_dmp ): 1 (True), 0 (False)
  • Data type ( data_type ): typology of the data on a formal level, e.g. image
  • Data content ( data_content ): (categorization of the data at the content level, and not on a content level, e.g., scanned image of a medieval manuscript) values are free-text descriptions
  • Format ( format ) : refers to the format and specifically the extension (if there is more than one per data, they can all be entered separated by commas, without putting the dot before the extension name)
  • New data ( new_data ) : 1 (True), 0 (False)
  • Contains personal data ( personal_data ): 1 (True), 0 (False)
  • Personal data management strategy ( p_d_strategy ): anonymization, pseudo-anonymization, no strategy
  • Level of access ( access ): open (CC BY or equivalent), controlled (CC BY-SA, CC BY-NC or equivalent), embargoed, unfiled
  • Reason of inaccessibility ( reason_inaccess ) : excessive size (therefore technical motivation), ethical issues, privacy, IPR (Intellectual Property Rights) issues
  • Size ( size ): orders of magnitude for digital data (Bytes, KB, MB, GB, TB, PB, EB, ZB, YB)
  • Deposited ( deposited ): 1 (True), 0 (False)
  • Chosen repository ( chosen_repo ): alphanumeric string for the name of repository chosen by researchers to deposit data
  • PID ( pid ): alphanumeric string - PID of the deposited entry
  • Associated publication ( associated_pub ): alphanumeric string - PID of the publication associated
  • Notes ( notes ): general notes concerning other unclassified issues
2.

The variables identified form the header of a table, which is then filled in with the information from the DMPs, then formalised in a CSV file.

Here is a sample of 13 rows of a fictional table developed to test the first version of the code (the data it contains is fictitious):

ABCDEFGHIJKLMNOPQRSTUVWXYZ
project_iddataset_identry_idcreator_unitproject_unitproject_programmeproject_typesubject_areamonth_DMPpublic_DMPdata_typedata_contentformatnew_datapersonal_datap_d_strategyaccessreason_inaccesssizedepositedchosen_repopidassociated_pubnotes
100110111DISTALDISTALH2020consortiumSocial SciencesM121tabularrevised transcriptions of interviewscsv10ndopenndKB0ndndnd
100110112DISTALDISTALH2020consortiumSocial SciencesM121textraw transcriptions of interviewstxt10ndopenndKB0ndndnd
100120121STATDISTALH2020consortiumSocial SciencesM121tabulardatabase of cultivated fieldsmyd10ndembargoedIPRGB0ndndnd
100130131BIGeADISTALH2020consortiumSocial SciencesM121tabularplant characteristics 2018-2019tsv10ndopenndMB1Zenodo12345678doi/mlkn123
100130132DISTALDISTALH2020consortiumSocial SciencesM121tabularplant characteristics 2019-2020tsv10ndopenndMB1Zenodo12345679doi/mlkn123
200210211ndFICLITHEindividualHumanitiesM60imagefacsimiles of primary sourcespdf00ndcontrolledIPRGB0ndndnd
200220221FICLITFICLITHEindividualHumanitiesM60interactive resourcecontents of interactive online map of authorial clusters (digital infrastructure)csv, pdf10ndopenndKB1AMSActaamsacta123nd
200220222FICLITFICLITHEindividualHumanitiesM60interactive resourcecode for interactive online map of authorial clusters (digital infrastructure)html, css, js10ndopenndMB1AMSActaamsacta124nd
200230231FILCOMFICLITHEindividualHumanitiesM60texttextual corpora of selected sourcesxml10ndopenndMB0ILC-CNR for CLARIN-ITndnd
300310311DIFADIFAHEindividualScienceM61textphysics and mathematical points for equationsjson, dat10ndopenndKB0ndndnd
400410411tbdDIMECH2020consortiumMedicineM60tabulargenetic data on model organismscsv10ndopenndMB0Open neuro, Neuromorphondpubmed3456
400420421tbdDIMECH2020consortiumMedicineM60textcensus of different neural network architecturesjson10ndopenndMB0ndndnd

Data analysis

3.

With the tabular data structured within the data collection phase as input, the data analysis will be descriptive statistics to investigate various research questions.

As for the first research question - what types of data are produced and managed by UniBO - related analysis are:

1) How often do we find different types of data in the same dataset? Do researchers organise datasets with several files of different formats with similar content or do they prefer the same data type within a single dataset?

2) How much do data types vary within a single project across all datasets produced and reused? How do they vary with respect to the subject area, the project's framework programme or the type of project (monobeneficiary, collaborative)?

3) Are the formats precisely defined at the month 6 DMP? Are they standard and open formats?

4) How many projects include re-used data in the DMP? What is the ratio of new data to re-used data?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts. On the other hand, incidences are also calculated and represented with a waterfall graph.

4.

As for the second research question - identifying trends of problems and patterns to improve the Data Stewardship service - related analysis are:

1) how many projects involve treatment of personal data? how many projects choose to anonymise data and publish them? Which personal data management strategies are preferred? 2) how many datasets are kept closed and what are the main reasons? 3) is data size a recurrent issue in choosing data repository? May it require infrastructural adjustments?

4) how many researchers make their DMP public? 5) which kind of repository are the most chosen ones? ?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts.

5.

As for the third research question - is there interdisciplinarity in data production at UniBO? - we consider only the data produced by UniBO, hence the rows of the table where the value of "new" is 1. Related analysis are:

1) How often does the project department coincide with the dataset creator's department?

2) is there interdisciplinarity between the types of data produced by the various departments, or is the type of data produced strictly related to the subject area?

3) Does the diversity between data types and departments of the creators or the project vary according to the project's framework programme?

4) is there more interdisciplinarity in single-beneficiary projects or in collaborative projects? i.e. how does the relationship between the variables data_type, project_unit and creator_unit change, depending on the type of project (collaborative or single-beneficiary)?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts.

Data publication

6.

The results will be organized in a CSV file and the graphs derived from the analyses saved as image files in non-proprietary open formats. Everything will then be deposited in an appropriate data repository and will be accompanied by accurate documentation, e.g., a README file specifying meaning of fields and values.

We also expect to be able to publish an article on this subject in a suitable journal.

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询