Methodology for DMPs analysis

Silvio Peroni, Giulia Caldoni, Bianca Gualandi, Mario Marino, Sara Coppini, francesca.masini

Published: 2023-07-14 DOI: 10.17504/protocols.io.n2bvj87jpgk5/v1

Abstract

All eyes on data: Unleashing the untapped potential of research at the University of Bologna

Led by data stewards at the University of Bologna, this preliminary protocol was developed within an analysis of research data generated and managed within the institution with respect to the differences and commonalities between disciplines and potential challenges for institutional data support services and infrastructures. We are primarily mapping the type (e.g., image), content (e.g., scan of a manuscript) and format (.tiff) of managed data, thus sustaining the value of FAIR data as granular resources.

The analysis is based on data management plans (DMPs) produced by grantees of Horizon Europe and Horizon 2020 funding who are affiliated to the University of Bologna and are either project coordinators or partners in charge of the DMP. We are including in the study only the DMPs shared with us between May 2022 (when the team was created) and October 2023.

In short, we have selected 23 variables of interest to be headers of a table that is progressively filled with information garnered through a close reading of the DMPs. Computational analysis (R version 4.2.3) on the collected data will produce graphs showing composition, relationship (bar graphs, pie charts and alluvial/sankey charts) and incidences (waterfall graph) of the different variables. The data and the software used will be published openly.

Before start

Before reusing this methodology, choose:

What do you mean by 'data', i.e. the object of analysis of this research as described in the data management plans on which it is based. We have chosen to consider "data" all research outputs that are digital (thus excluding physical and intangible research outputs) distinct from publications. This choice comes from the source materials on which the research is elaborated: DMPs of EU competitive projects.
Which taxonomies to use to define the possible values of the fields/variables of the analysis . We tried to reuse generalist and existing taxonomies whenever possible, but for three fields (creator unit, associated project unit, subject area) we chose to consider taxonomies defined for UniBO (list of departments and disciplinary areas of research).
A computational analysis tool . We chose R in the 4.2.3 version.

For more information on the choices we made, please see section "guidelines".

Steps

Data collection

Using the DMPs and GAs of European projects as input, we structured the table in which to collect data information with the following variables or fields and their meaning or accepted values:

Project identifier ( project_id ): alphanumeric string to identify the project to which the described data belong
Dataset identifier ( dataset_id ): alphanumeric string to identify the dataset to which the described data belong
Entry identifier ( entry_id ): alphanumeric string to identify the data category (i.e., file) described in the current row
Creator's unit ( creator_unit ): research unit (department, centre, etc.) of the principal investigator who created or reused the dataset (TBD and NA are also accepted)
Project unit ( project_unit ): research unit (department, centre, etc.) of the principal investigator of the project
Project programme ( project_programme ): HE (Horizon Europe); H2020 (Horizon 2020)
Project type ( project_type ): individual; consortium
Subject area ( subject_area ): disciplinary or thematic area to which the project belongs
Month DMP is delivered ( month_dmp ): e.g., M6 (sixth month), M12 (twelfth month), etc.
Public DMP ( public_dmp ): 1 (True), 0 (False)
Data type ( data_type ): typology of the data on a formal level, e.g. image
Data content ( data_content ): (categorization of the data at the content level, and not on a content level, e.g., scanned image of a medieval manuscript) values are free-text descriptions
Format ( format ) : refers to the format and specifically the extension (if there is more than one per data, they can all be entered separated by commas, without putting the dot before the extension name)
New data ( new_data ) : 1 (True), 0 (False)
Contains personal data ( personal_data ): 1 (True), 0 (False)
Personal data management strategy ( p_d_strategy ): anonymization, pseudo-anonymization, no strategy
Level of access ( access ): open (CC BY or equivalent), controlled (CC BY-SA, CC BY-NC or equivalent), embargoed, unfiled
Reason of inaccessibility ( reason_inaccess ) : excessive size (therefore technical motivation), ethical issues, privacy, IPR (Intellectual Property Rights) issues
Size ( size ): orders of magnitude for digital data (Bytes, KB, MB, GB, TB, PB, EB, ZB, YB)
Deposited ( deposited ): 1 (True), 0 (False)
Chosen repository ( chosen_repo ): alphanumeric string for the name of repository chosen by researchers to deposit data
PID ( pid ): alphanumeric string - PID of the deposited entry
Associated publication ( associated_pub ): alphanumeric string - PID of the publication associated
Notes ( notes ): general notes concerning other unclassified issues

The variables identified form the header of a table, which is then filled in with the information from the DMPs, then formalised in a CSV file.

Here is a sample of 13 rows of a fictional table developed to test the first version of the code (the data it contains is fictitious):

A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X
project_id	dataset_id	entry_id	creator_unit	project_unit	project_programme	project_type	subject_area	month_DMP	public_DMP	data_type	data_content	format	new_data	personal_data	p_d_strategy	access	reason_inaccess	size	deposited	chosen_repo	pid	associated_pub	notes
100	110	111	DISTAL	DISTAL	H2020	consortium	Social Sciences	M12	1	tabular	revised transcriptions of interviews	csv	1	0	nd	open	nd	KB	0	nd	nd	nd
100	110	112	DISTAL	DISTAL	H2020	consortium	Social Sciences	M12	1	text	raw transcriptions of interviews	txt	1	0	nd	open	nd	KB	0	nd	nd	nd
100	120	121	STAT	DISTAL	H2020	consortium	Social Sciences	M12	1	tabular	database of cultivated fields	myd	1	0	nd	embargoed	IPR	GB	0	nd	nd	nd
100	130	131	BIGeA	DISTAL	H2020	consortium	Social Sciences	M12	1	tabular	plant characteristics 2018-2019	tsv	1	0	nd	open	nd	MB	1	Zenodo	12345678	doi/mlkn123
100	130	132	DISTAL	DISTAL	H2020	consortium	Social Sciences	M12	1	tabular	plant characteristics 2019-2020	tsv	1	0	nd	open	nd	MB	1	Zenodo	12345679	doi/mlkn123
200	210	211	nd	FICLIT	HE	individual	Humanities	M6	0	image	facsimiles of primary sources	pdf	0	0	nd	controlled	IPR	GB	0	nd	nd	nd
200	220	221	FICLIT	FICLIT	HE	individual	Humanities	M6	0	interactive resource	contents of interactive online map of authorial clusters (digital infrastructure)	csv, pdf	1	0	nd	open	nd	KB	1	AMSActa	amsacta123	nd
200	220	222	FICLIT	FICLIT	HE	individual	Humanities	M6	0	interactive resource	code for interactive online map of authorial clusters (digital infrastructure)	html, css, js	1	0	nd	open	nd	MB	1	AMSActa	amsacta124	nd
200	230	231	FILCOM	FICLIT	HE	individual	Humanities	M6	0	text	textual corpora of selected sources	xml	1	0	nd	open	nd	MB	0	ILC-CNR for CLARIN-IT	nd	nd
300	310	311	DIFA	DIFA	HE	individual	Science	M6	1	text	physics and mathematical points for equations	json, dat	1	0	nd	open	nd	KB	0	nd	nd	nd
400	410	411	tbd	DIMEC	H2020	consortium	Medicine	M6	0	tabular	genetic data on model organisms	csv	1	0	nd	open	nd	MB	0	Open neuro, Neuromorpho	nd	pubmed3456
400	420	421	tbd	DIMEC	H2020	consortium	Medicine	M6	0	text	census of different neural network architectures	json	1	0	nd	open	nd	MB	0	nd	nd	nd

Data analysis

With the tabular data structured within the data collection phase as input, the data analysis will be descriptive statistics to investigate various research questions.

As for the first research question - what types of data are produced and managed by UniBO - related analysis are:

1) How often do we find different types of data in the same dataset? Do researchers organise datasets with several files of different formats with similar content or do they prefer the same data type within a single dataset?

2) How much do data types vary within a single project across all datasets produced and reused? How do they vary with respect to the subject area, the project's framework programme or the type of project (monobeneficiary, collaborative)?

3) Are the formats precisely defined at the month 6 DMP? Are they standard and open formats?

4) How many projects include re-used data in the DMP? What is the ratio of new data to re-used data?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts. On the other hand, incidences are also calculated and represented with a waterfall graph.

As for the second research question - identifying trends of problems and patterns to improve the Data Stewardship service - related analysis are:

1) how many projects involve treatment of personal data? how many projects choose to anonymise data and publish them? Which personal data management strategies are preferred? 2) how many datasets are kept closed and what are the main reasons? 3) is data size a recurrent issue in choosing data repository? May it require infrastructural adjustments? ?

4) how many researchers make their DMP public? 5) which kind of repository are the most chosen ones? ?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts.

As for the third research question - is there interdisciplinarity in data production at UniBO? - we consider only the data produced by UniBO, hence the rows of the table where the value of "new" is 1. Related analysis are:

1) How often does the project department coincide with the dataset creator's department?

2) is there interdisciplinarity between the types of data produced by the various departments, or is the type of data produced strictly related to the subject area?

3) Does the diversity between data types and departments of the creators or the project vary according to the project's framework programme?

4) is there more interdisciplinarity in single-beneficiary projects or in collaborative projects? i.e. how does the relationship between the variables data_type, project_unit and creator_unit change, depending on the type of project (collaborative or single-beneficiary)?

To answer these questions, computational analysis will produce graphs showing composition and relationship, thus: bar graphs (with stacks), pie charts and alluvial/sankey charts.

Data publication

The results will be organized in a CSV file and the graphs derived from the analyses saved as image files in non-proprietary open formats. Everything will then be deposited in an appropriate data repository and will be accompanied by accurate documentation, e.g., a README file specifying meaning of fields and values.

We also expect to be able to publish an article on this subject in a suitable journal.