Coverage of DOAJ journals' citations through OpenCitations - Protocol
Constance Dami, Alessandro Bertozzi, Chiara Manca, Umut Kucuk
Disclaimer
This protocol refers to a research done for the Open Science course 21/22 of the University of Bologna.
Abstract
This is the protocol for the research of the coverage of DOAJ journals' citations through OpenCitations.
Our goal is to find out:
- about the coverage of articles from open access journals in DOAJ journals as citing and cited articles,
- how many citations do DOAJ journals receive and do, and how many of these citations involve open access articles as both citing and cited entities,
- as well as the presence of trends over time of the availability of citations involving articles published in open access journals in DOAJ journals.
Our research focuses on DOAJ journals exclusively, using OpenCitations as a tool. Previous research has been made on open citations using COCI (Heibi, Peroni & Shotton 2019), and on DOAJ journals' citations (Saadat and Shabani 2012), paving the grounds for our present analysis.
After careful considerations on the best way to retrieve data from DOAJ and OpenCitations, we opted for downloading the public data dumps. Using the API resulted in a way too long running time, and the same problem arose for using the SPARQL endpoint of OpenCitations.
Minimal Bibliography
Björk, B.-C.; Kanto-Karvonen, S.; Harviainen, J.T. "How Frequently Are Articles in Predatory Open Access Journals Cited." Publications , 8 , 17. (2020) https://doi.org/10.3390/publications8020017
Heibi, I.; Peroni, S.; Shotton, D. "Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal" arXiv:1902.02534 (2019) https://doi.org/10.48550/arXiv.1902.02534
Pandita, R., & Singh, S. "A Study of Distribution and Growth of Open Access Research Journals Across the World. Publishing Research Quarterly" (2022), 38(1), 131–149. https://doi.org/10.1007/s12109-022-09860-x
Saadat, R., A. Shabani. "Investigating the citations received by journals of Directory of Open Access Journals from ISI Web of Science’s articles." International Journal of Information Science and Management (IJISM) 9.1 (2012): 57-74.
Solomon, D. J., Laakso, M., Björk, B.-C. "A longitudinal comparison of citation rates and growth among open access journals", Journal of Informetrics , 7, 3 (2013): 642-650. https://doi.org/10.1016/j.joi.2013.03.008.
Before start
Make sure to have Python 3.9 installed on your device.
All the dependecies of the script can be installed using the requirements.txt file stored into the github repository.
Computer technical specifications:
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz
RAM: 20,0 GB (19,9 GB usable) 2666 mhz
Steps
Data Gathering: DOAJ
Collecting data from DOAJ : we download data about journals and articles from the DOAJ website, and then refine it excluding all information that we are not interested in.
We download the data dumps from DOAJ in .tar.gz. format.
From the DOAJ dump , we create a unique key for each journal by concatenating the issn and the eissn, having as values: the issn (if it is present), eissn (if it is present), the title of the journal, the subject of the journal and the list of all the articles' DOIs.
After opening the tarfile containing the data, for every journal, we extract only the information about issn and eissn , first making sure that there is always at least one of the two for each record in the dump:
for journal in p:
try:
if journal["bibjson"]["pissn"]:
journal_issn = journal["bibjson"]["pissn"]
except KeyError:
journal_issn=""
try:
if journal["bibjson"]["eissn"]:
journal_eissn = journal["bibjson"]["eissn"]
except KeyError:
journal_eissn=""
```We then add to the set of journals our unique identifier " **issn+eissn** "
key_dict = f"{journal_issn}{journal_eissn}" journals.add(key_dict)
We extract the data of the articles: we open the file with tarfile, then for each article, we collect the information about issn and eissn of the journal publishing it, as well as the DOI of the article:
for article in p:
for el in article["bibjson"]["identifier"]:
if el["type"] == "pissn":
journal_issn = el["id"]
if el["type"] == "eissn":
journal_eissn = el["id"]
if el["type"] == "doi" or el["type"] == "DOI":
try:
art_doi = el["id"]
except KeyError:
art_doi = ""
If the article doesn't have any DOI registered, we add it to a list that we will store separately.
Otherwise we handle cases where the issn and eissn have been wrongly registered in the articles dump by aligning data with the journals set previously created.
if art_doi=="":
art_without_doi.append(article)
else:
journal_title=article["bibjson"]["journal"]["title"]
key_dict = f"{journal_issn}{journal_eissn}"
# if the issn and/or eissn from the articles dump don't match the journals dump
if key_dict not in journals:
# if there is only the issn registered: align with the journals metadata
if journal_issn in journals:
key_dict = journal_issn
# if there is only the eissn registered: align with the journals metadata
elif journal_eissn in journals:
key_dict = journal_eissn
else:
for issn in journals:
if journal_issn != "" and journal_issn in issn:
key_dict = issn
break
elif journal_eissn !="" and journal_eissn in issn:
key_dict = issn
break
```We collect the subject of the journal.
journal_subject = article["bibjson"]["subject"]
Once all of the information are collected, we add them to our final json, adding a new key if it doesn't exist or adding it to the list of dois for the journal.
if key_dict in doi_json: doi_json[key_dict]["dois"].append(art_doi) else: doi_json[key_dict] = {"title": journal_title, "pissn": journal_issn, "eissn": journal_eissn,"dois": [art_doi], "subject": journal_subject}
An example of an element in the final file:
{
"1779-627X1779-6288": {
"title": "International Journal for Simulation and Multidisciplinary Design Optimization",
"pissn": "1779-627X",
"eissn": "1779-6288",
"subject": [
{"code": "T55.4-60.8", "scheme": "LCC", "term": "Industrial engineering. Management engineering"},
{"code": "T11.95-12.5", "scheme": "LCC", "term": "Industrial directories"}]
"dois": [
"10.1051/ijsmdo:2008025",
"10.1051/smdo/2019012",
"10.1051/smdo/2020004",
"10.1051/smdo/2020001",
"10.1051/smdo/2016003",
...
]
},
...
}
<Note title="Citation" type="success" ><b>doi.json</b> <b>articles_without_dois.json</b> </Note>
We create a file containing a dictionary with all DOAJ articles' DOIs from DOAJ as keys and the "issn+eissn" identifier of the journal who published it as value, to simplify the next steps.
Data Gathering: OpenCitations
Collecting and filtering data from OpenCitations : we take the data from the download section, on the OpenCitations website, and then refine them using the files obtained from the previous step.
Filter Open Citations
We iterate all the records from the Open Citations dump , which have at least one doi in either the citing or cited column. For each directory:
- We unpack all the zip directory files in a temporary folder and iterate all over the unzip CSV files:
for csv in iterator
- We split the CSV file in two dataframes. For each dataframe we delete all the records that have a null value on the citing or cited column:
df_cited, df_null_cited = csv_manager.delete_null_values(df, 'cited')
df_citing, df_null_citing = csv_manager.delete_null_values(df, 'citing')
- For each dataframe, we filter all records which have a DOAJ doi either in the citing or the cited column:
df_cited = csv_manager.refine(df_cited, ['oci', 'creation', 'cited'], 'cited', data_json)
df_citing = csv_manager.refine(df_citing, ['oci', 'creation', 'citing'], 'citing', data_json)
```4. We add the journal name that matches the doi in the citing or cited column. Additionally, we add a column for both the cited (isDOAJ_cited) and the citing column (isDOAJ_citing), for identifying which doi belongs to DOAJ for each record (only the one in the cited column, the one in citing column, or in the dois in both columns):
df_cited = csv_manager.add_journal(df_cited, "cited", data_json)
df_citing = csv_manager.add_journal(df_citing, "citing", data_json)
df_result = df_citing.merge(df_cited, how='outer').reset_index(drop=True).convert_dtypes().drop_duplicates()
<Note title="Citation" type="success" ><b>null_citing</b> : a directory containing all files which have a null value in the citing column.</Note>
<Note title="Citation" type="success" ><b>null_cited</b> : a directory containing all files which have a null value in the cited column.</Note>
<Note title="Citation" type="success" ><b>filtered</b> : a directory containing all files filtered on both the citing and cited columns, which have at least one Dois from DOAJ journals dump.</Note>
Group By Open Citations results
We iterate on each file of the filtered directory and for each one:
- We transform the creation column into a date format:
df = csv_manager.add_year(df, 'creation')
```2. We save and discard all the records that don't have any creation dates or have a date bigger than 2024:
df, df_null, df_wrong = csv_manager.save_errors(df, name_file)
df_normal = csv_manager.groupBy_year(df)
df_by_journal = csv_manager.groupBy_year_and_journal(df)
<Note title="Citation" type="success" ><b>normal</b> : a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. <span>These files list the following fields:</span><span>year </span><span>number of citations received by DOAJ inside Open Citations (cited)</span><span>number of citations done by DOAJ inside Open Citations (citing)</span><span>number of citations done to itself by DOAJ inside Open Citations (self-citations)</span></Note>
<Note title="Citation" type="success" ><b>by_journal</b> : a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. These files list the following fields:<span>year</span><span>code of the journal (ISSN + EISSN)</span><span>number of citations received by the DOAJ journal inside Open Citations (cited)</span><span>number of citations done by the DOAJ journal inside Open Citations (citing)</span><span>number of citations done to itself by the DOAJ journal inside Open Citations (self-citations)</span><span>number of citations done by the DOAJ journal to another DOAJ journal inside Open Citations (citations to DOAJ)</span><span>number of citations received by the DOAJ journal from another DOAJ journal inside Open Citations (cited by DOAJ)</span></Note>
<Note title="Citation" type="success" ><b>null_dates</b> : a directory containing all files which have a null date (= Null) in the creation column.<b>wrong_dates</b> : a directory containing all files which have a wrong date (>= 2025) in the creation column.</Note>
Concatenate all results
We concatenate, using the Pandas library, all the files in the normal repository and in the by_journal repository, to summarize all values in two dataframes:
df_normal = csv_manager.concat_csv_normal(all_csv_normal)
df_by_journal = csv_manager.concat_csv_journal(all_csv_byJournal)
```We add to the df_by_journal the group of fields extracted from DOAJ for each journal, which adds useful information about the journal:
df_by_journal = csv_manager.add_to_journals_DOAJ_descriptions(df_by_journal, df_journals_description)
df_null_dates = csv_manager.concat_csv(all_csv_null_dates)
df_wrong_dates = csv_manager.concat_csv(all_csv_wrong_dates)
df_null_citing = csv_manager.concat_csv(all_csv_null_citing)
df_null_cited = csv_manager.concat_csv(all_csv_null_cited)
df_articles_without_dois = pd.read_json(all_articles_without_dois, orient='records')
df_errors = pd.DataFrame({'type_of_error': ['null_dates', 'wrong_dates', 'null_citing', 'null_cited', 'articles_without_dois'], 'count': [sum(df_null_dates['oci']), sum(df_wrong_dates['oci']), len(df_null_citing), len(df_null_cited), len(df_articles_without_dois)]})
<Note title="Citation" type="success" ><b>normal.json</b> : a file where each record lists the following fields:<span></span><span>year</span><span>cited: total number of citations received by DOAJ in Open Citations</span><span>citing: total number of citations done by DOAJ in Open Citations</span><span>self_citation: total number of citations done by a DOAJ journal to another DOAJ journal</span></Note>
<Note title="Citation" type="success" ><b>by_journal.json</b> : a file where each record lists the following fields:<span></span><span>year</span><span>journal</span><span>cited: total number of citations received by the journal in Open Citations</span><span>citing: total number of citations done by the journal in Open Citations</span><span>self_citation: total number of citations done by a DOAJ journal to another DOAJ journal</span><span>citations_to_DOAJ: total number of citations done by a DOAJ journal to another DOAJ journal</span><span>cited_by_DOAJ: total number of citations received by a DOAJ journal from another DOAJ journal</span></Note>
<Note title="Citation" type="success" ><b>errors.json</b> : a file that summarizes all the errors found during the previous computation:<span></span><span>null_dates</span><span>wrong_dates</span><span>null_citing</span><span>null_cited</span><span>articles_without_dois</span></Note>
Add Ratios to the final results
- We add ratios to the normal.json:
normal_json = csv_manager.make_ratio(normal_json)
```2. We add ratios to the **by_journal.json** :
by_journal_json = csv_manager.make_ratio_journal(by_journal_json)
<Note title="Citation" type="success" ><b>by_journal.json</b> : the same file as the previous one with in addition a ratio within metrics:<span>citing_cited_pcent</span><span>citations_to_DOAJ_pcent</span><span>cited_by_DOAJ_pcent</span><span>self_citation_pcent</span><span>citing_cited_ratio</span><span>citations_to_DOAJ_ratio</span><span>cited_by_DOAJ_r: the same file as the previous one with in addition a ratio within metrics</span><span>self_citation_ratio</span></Note>
<Note title="Citation" type="success" ><b>normal.json</b> : the same file as the previous one with in addition a ratio within metrics.<span>citing_cited_pcent</span><span>self_citation_pcent</span><span>citing_cited_ratio</span><span>self_citation_ratio</span></Note>
Add useful metrics
We add the following metrics to a JSON file, in order to provide a summary of useful research information about dois processed from DOAJ.
Data Visualization
We visualize our results in line , bar and scatter graphs with the use of the plotly Python library.
We load our json data from the queried folder in DataFrames of the pandas library.
import pandas as pd
import plotly.express as px
final_df_years = pd.read_json('../../queried/final_output/normal.json')
final_df_journal = pd.read_json('../../queried/final_output/by_journal.json')
errors = pd.read_json('../../queried/final_output/errors.json')
We query the final_df_journal data frame to find the biggest of DOAJ in terms of the most number of citations, references, citations to DOAJ journals and citations from DOAJ journals.
group_journals = final_df_journal.groupby(['title'])['cited','citing','citations_to_DOAJ','cited_by_DOAJ'].sum()
group_journals.idxmax()
final_df_journal_1 = final_df_journal[final_df_journal['title']==group_journals['citing'].idxmax()]
To have a better understanding of our data, we examine the most recurring subjects among DOAJ journals.
final_df_journal['subject'] = final_df_journal['subject'].apply(lambda x: [y['term'] for y in x])
#some journals have several subjects, so we separate them to be able to plot it
exploded = final_df_journal.explode('subject')
#group journals by subject
grouped_df = exploded.groupby(['subject', 'year']
).size().reset_index(name="num_journals")
grouped_df = grouped_df.sort_values(['num_journals'], ascending=False)
#selecting journals to plot and limiting the results to the 21st century.
most_journals_by_subject = grouped_df.drop_duplicates(subset=['subject']).head(20)['subject'].tolist()
grouped_df = grouped_df.loc[grouped_df['year']>1999].sort_values(['year','num_journals'], ascending=False)
```We represent it with a _line_ plot.
<Note title="Citation" type="success" ><span>Timeline of the 20 most recurring subjects among DOAJ journals</span></Note>
In order to examine the citations made by journals overall, regardless of the year, we group the journals by title and sum the relevant columns.
group_journals = final_df_journal.groupby(['title'], as_index=False).agg({'dois_count':'first', 'subject':'first','cited':'sum', 'citing':'sum', 'self_citation':'sum', 'citations_to_DOAJ':'sum', 'cited_by_DOAJ':'sum'})
```We then use _bar_ plots to visualize the citations data about DOAJ journals
<Note title="Citation" type="success" ><span>Bar plot of the 30 DOAJ journals doing the most citations.</span><span>Bar plot of the 30 DOAJ journals doing the most citations to DOAJ journals.</span><span>Bar plot of the 30 DOAJ journals getting cited the most.</span><span>Bar plot of the 30 DOAJ journals getting cited the most by DOAJ journals.</span></Note>
We repeat step 3.3 with scatter plots, including information about the number of articles per journal.
We examine the journals doing the most self-citations by year, using a line plot.
self_citations_df = final_df_journal.sort_values(["self_citation"], ascending =False)
list_journals = self_citations_df.drop_duplicates(['journal']).head(20)['title'].tolist()
most_self_cit = self_citations_df.loc[self_citations_df['title'].isin(list_journals)]
most_self_cit = most_self_cit[most_self_cit.year > 1999].sort_values(['year'])
To have a better comparison between the citing and cited of DOAJ journals in the last 20 years, we do some bar plots that stack the two amounts in the same column.
We use a bar plot to visualize the timeline, in the last 20 years, of the number of citations involving DOAJ journals as both citing and cited entities and the percentage of it inside the number of general citations.
We use a bar plot to show the number of errors we encountered in the project divided by category.
Publishing data
We publish the following JSON files in Zenodo and also in our Github repository ( queried folder).