Coverage of DOAJ journals' citations through OpenCitations - Protocol

Constance Dami, Alessandro Bertozzi, Chiara Manca, Umut Kucuk

Published: 2022-12-14 DOI: 10.17504/protocols.io.n92ldz598v5b/v5

Disclaimer

This protocol refers to a research done for the Open Science course 21/22 of the University of Bologna.

Abstract

This is the protocol for the research of the coverage of DOAJ journals' citations through OpenCitations.

Our goal is to find out:

  • about the coverage of articles from open access journals in DOAJ journals as citing and cited articles,
  • how many citations do DOAJ journals receive and do, and how many of these citations involve open access articles as both citing and cited entities,
  • as well as the presence of trends over time of the availability of citations involving articles published in open access journals in DOAJ journals.

Our research focuses on DOAJ journals exclusively, using OpenCitations as a tool. Previous research has been made on open citations using COCI (Heibi, Peroni & Shotton 2019), and on DOAJ journals' citations (Saadat and Shabani 2012), paving the grounds for our present analysis.

After careful considerations on the best way to retrieve data from DOAJ and OpenCitations, we opted for downloading the public data dumps. Using the API resulted in a way too long running time, and the same problem arose for using the SPARQL endpoint of OpenCitations.

Minimal Bibliography

Björk, B.-C.; Kanto-Karvonen, S.; Harviainen, J.T. "How Frequently Are Articles in Predatory Open Access Journals Cited." Publications , 8 , 17. (2020) https://doi.org/10.3390/publications8020017

Heibi, I.; Peroni, S.; Shotton, D. "Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal" arXiv:1902.02534 (2019) https://doi.org/10.48550/arXiv.1902.02534

Pandita, R., & Singh, S. "A Study of Distribution and Growth of Open Access Research Journals Across the World. Publishing Research Quarterly" (2022), 38(1), 131–149. https://doi.org/10.1007/s12109-022-09860-x

Saadat, R., A. Shabani. "Investigating the citations received by journals of Directory of Open Access Journals from ISI Web of Science’s articles." International Journal of Information Science and Management (IJISM) 9.1 (2012): 57-74.

Solomon, D. J., Laakso, M., Björk, B.-C. "A longitudinal comparison of citation rates and growth among open access journals", Journal of Informetrics , 7, 3 (2013): 642-650. https://doi.org/10.1016/j.joi.2013.03.008.

Before start

Make sure to have Python 3.9 installed on your device.

All the dependecies of the script can be installed using the requirements.txt file stored into the github repository.

Computer technical specifications:

CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz

RAM: 20,0 GB (19,9 GB usable) 2666 mhz

Steps

Data Gathering: DOAJ

1.

Collecting data from DOAJ : we download data about journals and articles from the DOAJ website, and then refine it excluding all information that we are not interested in.

Citation
doi.json , containing a dictionary of all the journals as key and some information including the list of all the DOIs of the articles published in this journal as value,articles_without_dois.json , containing all the articles excluded from our research, due to the lack of DOI provided,dois_articles_journals.pickle , a dictionary with every DOI of the articles as key and the unique identifier for the journal publishing it as value.

1.1.

We download the data dumps from DOAJ in .tar.gz. format.

Dateset
DOAJ articles public data dumphttps://doaj.org/public-data-dump/article

Dateset
DOAJ journals public data dumphttps://doaj.org/public-data-dump/journal
Both datasets contain metadata that is not useful for our research, so we need to filter only the necessary data.

1.2.

From the DOAJ dump , we create a unique key for each journal by concatenating the issn and the eissn, having as values: the issn (if it is present), eissn (if it is present), the title of the journal, the subject of the journal and the list of all the articles' DOIs.

After opening the tarfile containing the data, for every journal, we extract only the information about issn and eissn , first making sure that there is always at least one of the two for each record in the dump:

for journal in p:
     try:
        if journal["bibjson"]["pissn"]:
           journal_issn = journal["bibjson"]["pissn"]
     except KeyError:
        journal_issn=""
     try:
        if journal["bibjson"]["eissn"]:
          journal_eissn = journal["bibjson"]["eissn"]
     except KeyError:
         journal_eissn=""
```We then add to the set of journals our unique identifier " **issn+eissn** "



key_dict = f"{journal_issn}{journal_eissn}" journals.add(key_dict)

1.3.

We extract the data of the articles: we open the file with tarfile, then for each article, we collect the information about issn and eissn of the journal publishing it, as well as the DOI of the article:

for article in p:
    for el in article["bibjson"]["identifier"]:
          if el["type"] == "pissn":
               journal_issn = el["id"]
          if el["type"] == "eissn":
               journal_eissn = el["id"]
          if el["type"] == "doi" or el["type"] == "DOI":
               try:
                   art_doi = el["id"]
               except KeyError:
                   art_doi = ""

If the article doesn't have any DOI registered, we add it to a list that we will store separately.

Otherwise we handle cases where the issn and eissn have been wrongly registered in the articles dump by aligning data with the journals set previously created.

if art_doi=="":
    art_without_doi.append(article)
else:
    journal_title=article["bibjson"]["journal"]["title"]

    key_dict = f"{journal_issn}{journal_eissn}"

    # if the issn and/or eissn from the articles dump don't match the journals dump
    if key_dict not in journals:

        # if there is only the issn registered: align with the journals metadata 
        if journal_issn in journals:
                key_dict = journal_issn
        # if there is only the eissn registered: align with the journals metadata 
        elif journal_eissn in journals:
                 key_dict = journal_eissn
        else:
                for issn in journals:
                      if journal_issn != "" and journal_issn in issn:
                          key_dict = issn
                          break
                      elif journal_eissn !="" and journal_eissn in issn:
                          key_dict = issn
                          break
```We collect the subject of the journal.

journal_subject = article["bibjson"]["subject"]


Once all of the information are collected, we add them to our final json, adding a new key if it doesn't exist or adding it to the list of dois for the journal.

if key_dict in doi_json: doi_json[key_dict]["dois"].append(art_doi) else: doi_json[key_dict] = {"title": journal_title, "pissn": journal_issn, "eissn": journal_eissn,"dois": [art_doi], "subject": journal_subject}


An example of an element in the final file:

{ "1779-627X1779-6288": { "title": "International Journal for Simulation and Multidisciplinary Design Optimization", "pissn": "1779-627X", "eissn": "1779-6288", "subject": [ {"code": "T55.4-60.8", "scheme": "LCC", "term": "Industrial engineering. Management engineering"}, {"code": "T11.95-12.5", "scheme": "LCC", "term": "Industrial directories"}] "dois": [ "10.1051/ijsmdo:2008025", "10.1051/smdo/2019012", "10.1051/smdo/2020004",
"10.1051/smdo/2020001", "10.1051/smdo/2016003", ... ] }, ... }


<Note title="Citation" type="success" ><b>doi.json</b> <b>articles_without_dois.json</b> </Note>

1.4.

We create a file containing a dictionary with all DOAJ articles' DOIs from DOAJ as keys and the "issn+eissn" identifier of the journal who published it as value, to simplify the next steps.

Citation
dois_articles_journals.pickle

Data Gathering: OpenCitations

2.

Collecting and filtering data from OpenCitations : we take the data from the download section, on the OpenCitations website, and then refine them using the files obtained from the previous step.

Citation
by_journal.json : a file containing all information extracted by Open Citations about DOAJ journals divide by year and journal name. Inside the file, the researcher can find these fields:A group of fields describing the selected journal.the code of the journal (obtained by merging together the journal's ISSN and EISSN).The year which all citations' metrics belong to.The number of citations received.The number of citations done.The ratios between citations done and received.The number of citations received from other DOAJ journals.The number of citations done to other DOAJ journals.The ratios between citations done and received from and to DOAJ journals.

Citation
normal.json : a file containing all information extracted from Open Citations about DOAJ journals divided only by year. Inside the file, the researcher can find these fields:The year which all citations' metrics belong to.The number of citations received.The number of citations done.The ratios between citations done and received.The number of self-citations made by DOAJ inside Open Citations.The ratio between the self-citation and the total citations received and done by DOAJ.

Citation
erros.json : a file containing all errors obtained from computations. Inside the file, the researcher can find these fields:errors about records that don't have any specified date (null dates).errors about records that have impossible dates (wrong dates).errors about articles that don't have any specified Dois.errors about Open Citations records that don't have any Dois in the citing or cited fields.

Citation
DOAJ_metrics.json : a file containing metrics about DOAJ and Open Citations, obtained by computations. Inside the file, the researcher can find these fields:Number of journals with dois.Number of articles which have been processed during computations.Number of used Dois. All dois (with no repetition) which are used for the adding journal operationion in the second pipeline step.Number of repeated Dois. All dois which are repeated inside the same or in another journal.Number of accepted Dois. All articles (with repetition) which have both a defined journal and a defined doi.

Filter Open Citations

2.1.

We iterate all the records from the Open Citations dump , which have at least one doi in either the citing or cited column. For each directory:

  1. We unpack all the zip directory files in a temporary folder and iterate all over the unzip CSV files:
for csv in iterator
  1. We split the CSV file in two dataframes. For each dataframe we delete all the records that have a null value on the citing or cited column:
df_cited, df_null_cited = csv_manager.delete_null_values(df, 'cited')

df_citing, df_null_citing = csv_manager.delete_null_values(df, 'citing')
  1. For each dataframe, we filter all records which have a DOAJ doi either in the citing or the cited column:
df_cited = csv_manager.refine(df_cited, ['oci', 'creation', 'cited'], 'cited', data_json)

df_citing = csv_manager.refine(df_citing, ['oci', 'creation', 'citing'], 'citing', data_json)
```4. We add the journal name that matches the doi in the citing or cited column. Additionally, we add a column for both the cited (isDOAJ_cited) and the citing column (isDOAJ_citing), for identifying which doi belongs to DOAJ for each record (only the one in the cited column, the one in citing column, or in the dois in both columns):

df_cited = csv_manager.add_journal(df_cited, "cited", data_json)

df_citing = csv_manager.add_journal(df_citing, "citing", data_json)

df_result = df_citing.merge(df_cited, how='outer').reset_index(drop=True).convert_dtypes().drop_duplicates()


<Note title="Citation" type="success" ><b>null_citing</b> : a directory containing all files which have a null value in the citing column.</Note>

<Note title="Citation" type="success" ><b>null_cited</b> : a directory containing all files which have a null value in the cited column.</Note>

<Note title="Citation" type="success" ><b>filtered</b> : a directory containing all files filtered on both the citing and cited columns, which have at least one Dois from DOAJ journals dump.</Note>

Group By Open Citations results

2.2.

We iterate on each file of the filtered directory and for each one:

  1. We transform the creation column into a date format:
df = csv_manager.add_year(df, 'creation')
```2. We save and discard all the records that don't have any creation dates or have a date bigger than 2024:

df, df_null, df_wrong = csv_manager.save_errors(df, name_file)

df_normal = csv_manager.groupBy_year(df)

df_by_journal = csv_manager.groupBy_year_and_journal(df)


<Note title="Citation" type="success" ><b>normal</b> : a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. <span>These files list the following fields:</span><span>year </span><span>number of citations received by DOAJ inside Open Citations (cited)</span><span>number of citations done by DOAJ inside Open Citations (citing)</span><span>number of citations done to itself by DOAJ inside Open Citations (self-citations)</span></Note>

<Note title="Citation" type="success" ><b>by_journal</b> : a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. These files list the following fields:<span>year</span><span>code of the journal (ISSN + EISSN)</span><span>number of citations received by the DOAJ journal inside Open Citations (cited)</span><span>number of citations done by the DOAJ journal inside Open Citations (citing)</span><span>number of citations done to itself by the DOAJ journal inside Open Citations (self-citations)</span><span>number of citations done by the DOAJ journal to another DOAJ journal inside Open Citations (citations to DOAJ)</span><span>number of citations received by the DOAJ journal from another DOAJ journal inside Open Citations (cited by DOAJ)</span></Note>

<Note title="Citation" type="success" ><b>null_dates</b> : a directory containing all files which have a null date (= Null) in the creation column.<b>wrong_dates</b> : a directory containing all files which have a wrong date (>= 2025) in the creation column.</Note>

Concatenate all results

2.3.

We concatenate, using the Pandas library, all the files in the normal repository and in the by_journal repository, to summarize all values in two dataframes:

df_normal = csv_manager.concat_csv_normal(all_csv_normal)

df_by_journal = csv_manager.concat_csv_journal(all_csv_byJournal)
```We add to the df_by_journal the group of fields extracted from DOAJ for each journal, which adds useful information about the journal:

df_by_journal = csv_manager.add_to_journals_DOAJ_descriptions(df_by_journal, df_journals_description)

df_null_dates = csv_manager.concat_csv(all_csv_null_dates)

df_wrong_dates = csv_manager.concat_csv(all_csv_wrong_dates)

df_null_citing = csv_manager.concat_csv(all_csv_null_citing)

df_null_cited = csv_manager.concat_csv(all_csv_null_cited)

df_articles_without_dois = pd.read_json(all_articles_without_dois, orient='records')

df_errors = pd.DataFrame({'type_of_error': ['null_dates', 'wrong_dates', 'null_citing', 'null_cited', 'articles_without_dois'], 'count': [sum(df_null_dates['oci']), sum(df_wrong_dates['oci']), len(df_null_citing), len(df_null_cited), len(df_articles_without_dois)]})


<Note title="Citation" type="success" ><b>normal.json</b> : a file where each record lists the following fields:<span></span><span>year</span><span>cited: total number of citations received by DOAJ in Open Citations</span><span>citing: total number of citations done by DOAJ in Open Citations</span><span>self_citation: total number of citations done by a DOAJ journal to another DOAJ journal</span></Note>

<Note title="Citation" type="success" ><b>by_journal.json</b> : a file where each record lists the following fields:<span></span><span>year</span><span>journal</span><span>cited: total number of citations received by the journal in Open Citations</span><span>citing: total number of citations done by the journal in Open Citations</span><span>self_citation: total number of citations done by a DOAJ journal to another DOAJ journal</span><span>citations_to_DOAJ:  total number of citations done by a DOAJ journal to another DOAJ journal</span><span>cited_by_DOAJ: total number of citations received by a DOAJ journal from another DOAJ journal</span></Note>

<Note title="Citation" type="success" ><b>errors.json</b> : a file that summarizes all the errors found during the previous computation:<span></span><span>null_dates</span><span>wrong_dates</span><span>null_citing</span><span>null_cited</span><span>articles_without_dois</span></Note>

Add Ratios to the final results

2.4.
  1. We add ratios to the normal.json:
normal_json = csv_manager.make_ratio(normal_json)
```2. We add ratios to the  **by_journal.json** :



by_journal_json = csv_manager.make_ratio_journal(by_journal_json)


<Note title="Citation" type="success" ><b>by_journal.json</b> : the same file as the previous one with in addition a ratio within metrics:<span>citing_cited_pcent</span><span>citations_to_DOAJ_pcent</span><span>cited_by_DOAJ_pcent</span><span>self_citation_pcent</span><span>citing_cited_ratio</span><span>citations_to_DOAJ_ratio</span><span>cited_by_DOAJ_r: the same file as the previous one with in addition a ratio within metrics</span><span>self_citation_ratio</span></Note>

<Note title="Citation" type="success" ><b>normal.json</b> : the same file as the previous one with in addition a ratio within metrics.<span>citing_cited_pcent</span><span>self_citation_pcent</span><span>citing_cited_ratio</span><span>self_citation_ratio</span></Note>

Add useful metrics

2.5.

We add the following metrics to a JSON file, in order to provide a summary of useful research information about dois processed from DOAJ.

Citation
DOAJ_metrics.json : a file where the reasearcher can find some information about Dois processed from DOAJNumber of journals with dois.Number of articles that have been processed during computations.Number of used Dois: all the dois (with no repetition) which are used for the adding journal operation in the second pipeline step.Number of repeated Dois: all dois which are repeated inside the same or in another journal.Number of accepted Dois: all articles (with repetition) which have both a defined journal and a defined doi.

Data Visualization

3.

We visualize our results in line , bar and scatter graphs with the use of the plotly Python library.

We load our json data from the queried folder in DataFrames of the pandas library.

import pandas as pd
import plotly.express as px

final_df_years = pd.read_json('../../queried/final_output/normal.json')
final_df_journal = pd.read_json('../../queried/final_output/by_journal.json')
errors = pd.read_json('../../queried/final_output/errors.json')
3.1.

We query the final_df_journal data frame to find the biggest of DOAJ in terms of the most number of citations, references, citations to DOAJ journals and citations from DOAJ journals.

group_journals = final_df_journal.groupby(['title'])['cited','citing','citations_to_DOAJ','cited_by_DOAJ'].sum()

group_journals.idxmax()

Citation
cited PLoS ONE citing PLoS ONE citations_to_DOAJ PLoS ONE cited_by_DOAJ PLoS ONE
We create the final_df_journal_1 data frame with the result of the query.

final_df_journal_1 = final_df_journal[final_df_journal['title']==group_journals['citing'].idxmax()]
3.2.

To have a better understanding of our data, we examine the most recurring subjects among DOAJ journals.

final_df_journal['subject'] = final_df_journal['subject'].apply(lambda x: [y['term'] for y in x])

#some journals have several subjects, so we separate them to be able to plot it
exploded = final_df_journal.explode('subject') 
#group journals by subject
grouped_df = exploded.groupby(['subject', 'year']
                          ).size().reset_index(name="num_journals")
grouped_df = grouped_df.sort_values(['num_journals'], ascending=False)
#selecting journals to plot and limiting the results to the 21st century.
most_journals_by_subject = grouped_df.drop_duplicates(subset=['subject']).head(20)['subject'].tolist()
grouped_df = grouped_df.loc[grouped_df['year']>1999].sort_values(['year','num_journals'], ascending=False)
```We represent it with a  _line_  plot.



<Note title="Citation" type="success" ><span>Timeline of the 20 most recurring subjects among DOAJ journals</span></Note>

3.3.

In order to examine the citations made by journals overall, regardless of the year, we group the journals by title and sum the relevant columns.

group_journals = final_df_journal.groupby(['title'], as_index=False).agg({'dois_count':'first', 'subject':'first','cited':'sum', 'citing':'sum', 'self_citation':'sum', 'citations_to_DOAJ':'sum', 'cited_by_DOAJ':'sum'})
```We then use  _bar_  plots to visualize the citations data about DOAJ journals



<Note title="Citation" type="success" ><span>Bar plot of the 30 DOAJ journals doing the most citations.</span><span>Bar plot of the 30 DOAJ journals doing the most citations to DOAJ journals.</span><span>Bar plot of the 30 DOAJ journals getting cited the most.</span><span>Bar plot of the 30 DOAJ journals getting cited the most by DOAJ journals.</span></Note>

3.4.

We repeat step 3.3 with scatter plots, including information about the number of articles per journal.

Citation
Scatter plot of the 30 DOAJ journals doing the most citations, with size by number of articles.Scatter plot of the 30 DOAJ journals doing the most citations to DOAJ journals, with size by number of articles.Scatter plot of the 30 DOAJ journals getting cited the most, with size by number of articles.Scatter plot of the 30 DOAJ journals getting cited the most by DOAJ journals, with size by number of articles.

3.5.

We examine the journals doing the most self-citations by year, using a line plot.

self_citations_df = final_df_journal.sort_values(["self_citation"], ascending =False)
list_journals = self_citations_df.drop_duplicates(['journal']).head(20)['title'].tolist()
most_self_cit = self_citations_df.loc[self_citations_df['title'].isin(list_journals)]
most_self_cit = most_self_cit[most_self_cit.year > 1999].sort_values(['year'])

Citation
Timeline of journals doing the most self-citations since 2000.

3.6.

To have a better comparison between the citing and cited of DOAJ journals in the last 20 years, we do some bar plots that stack the two amounts in the same column.

Citation
Timeline of comparison between citing and cited of all DOAJ journals in the last 20 years,Timeline of comparison between citations and references of the biggest DOAJ journal in the last 20 years,Timeline of comparison between the number of citations and references from the biggest DOAJ journal to DOAJ journals in the last 20 years,Timeline of comparison between the percentage of citations and references from the biggest DOAJ journal to DOAJ journals in the last 20 years.

3.7.

We use a bar plot to visualize the timeline, in the last 20 years, of the number of citations involving DOAJ journals as both citing and cited entities and the percentage of it inside the number of general citations.

Citation
Timeline of number of citations both coming from and going to DOAJ journals in the last 20 years,Timeline of percentage of citations both coming from and going to DOAJ journals in the last 20 yearsTimeline of percentage of citations both coming from and going to DOAJ journals in the last 20 years, 20 years,Timeline of percentage of citations going to DOAJ journals from the biggest DOAJ journal in the last 20 years.

3.8.

We use a bar plot to show the number of errors we encountered in the project divided by category.

Citation
Types of errors and their count.

Publishing data

4.

We publish the following JSON files in Zenodo and also in our Github repository ( queried folder).

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询