Harnessing the 3D-Beacons Network: A Comprehensive Guide to Accessing and Displaying Protein Structure Data

Paulyna Magaña, Paulyna Magaña, Sreenath Nair, Sreenath Nair, Mihaly Varadi, Mihaly Varadi, Sameer Velankar, Sameer Velankar

Published: 2024-05-08 DOI: 10.1002/cpz1.1047

Abstract

Recent advancements in protein structure determination and especially in protein structure prediction techniques have led to the availability of vast amounts of macromolecular structures. However, the accessibility and integration of these structures into scientific workflows are hindered by the lack of standardization among publicly available data resources. To address this issue, we introduced the 3D-Beacons Network, a unified platform that aims to establish a standardized framework for accessing and displaying protein structure data. In this article, we highlight the importance of standardized approaches for accessing protein structure data and showcase the capabilities of 3D-Beacons. We describe four protocols for finding and accessing macromolecular structures from various specialist data resources via 3D-Beacons. First, we describe three scenarios for programmatically accessing and retrieving data using the 3D-Beacons API. Next, we show how to perform sequence-based searches to find structures from model providers. Then, we demonstrate how to search for structures and fetch them directly into a workflow using JalView. Finally, we outline the process of facilitating access to data from providers interested in contributing their structures to the 3D-Beacons Network. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Programmatic access to the 3D-Beacons API

Basic Protocol 2 : Sequence-based search using the 3D-Beacons API

Basic Protocol 3 : Accessing macromolecules from 3D-Beacons with JalView

Basic Protocol 4 : Enhancing data accessibility through 3D-Beacons

INTRODUCTION

Emerging from the confluence of breakthroughs in experimental methods, such as cryo-electron microscopy and the dawn of AI-based structure prediction tools exhibiting unparalleled accuracy, the sources of macromolecular structures have proliferated (Lin et al., 2023; Varadi, Anyango, et al., 2022). This growth has resulted in a dynamic landscape of available information, leading to a need for common data standards and a mechanism for unified access to the multitude of molecular structure resources.

Publicly available data resources for protein structures often provide diverse data access mechanisms, hindering seamless access and integration of the data into scientific workflows. Each resource follows its own rules for storing and presenting protein structures, resulting in a fragmented landscape that poses challenges for researchers. This fragmentation becomes particularly apparent when dealing with different types of structures, ranging from experimentally determined to computationally predicted models. For example, whereas the Protein Data Bank (Velankar et al., 2021) contains over 200,000 entries, many of which are macromolecular assemblies, the AlphaFold Protein Structure Database (AlphaFold DB) (Varadi, Anyango, et al., 2022) contains over 214 million predictions for single polypeptide chains. On the other hand, AlphaFill (Hekkelman et al., 2021) has expanded predicted structures by adding known ligands. Other, more specialized data resources, like Isoforms.io (Sommer et al., 2022) and the ABC family transporter dataset of the HegeLab (Tordai et al., 2022), provide smaller but functionally important model datasets. Additionally, including data from the Small-Angle Scattering Biological Data Bank (Kikhney et al., 2020) and Protein Ensemble Database (PED) (Ghafouri et al., 2024) further adds to the complexity of effectively accessing and utilizing protein structure information with low-resolution structural envelopes and highly diverse conformational ensembles.

The existence of multiple, non-standardized approaches to accessing protein structure data slows the pace of scientific advancement. With the sudden influx of hundreds of millions of new macromolecular models, it became crucial to establish a standardized framework encompassing various protein structures and providing a unified interface for their retrieval. Such a standardized approach would streamline data access, enable efficient data integration into scientific workflows, and foster collaboration across research communities. The genomics data domain already has infrastructure to tackle this problem, the ELIXIR Beacon network (Rambla et al., 2022), which not only allows FAIR (Findable, Accessible, Interoperable, and Reusable) data access but also addresses data confidentiality while handling sensitive variants data. Based loosely on the same concepts, the 3D-Beacons Network (Varadi, Nair, et al., 2022) established an open collaboration among providers of macromolecular structure models to present model coordinates and meta-information in a standardized data format from all participating data resources on a unified platform.

By highlighting the importance of a standardized approach and showcasing the capabilities of 3D-Beacons, we aim to promote the adoption of unified data access mechanisms in structural biology to improve the findability and accessibility of structure data, making it FAIRer (Wilkinson et al., 2016). Establishing a standardized framework for accessing protein structure data will enhance scientific collaborations and accelerate discoveries in areas such as protein function elucidation, drug design, and understanding of the underlying mechanisms of complex biological processes.

Through the following protocols, we present 3D-Beacons as a solution to the challenges posed by the disparate nature of publicly available protein structure datasets. By leveraging the power of 3D-Beacons, researchers gain access to a standardized and comprehensive platform encompassing a wide range of structure types, from experimentally determined to predicted models. Importantly, the network provides access not only to the model files but also to essential metadata, such as confidence metrics. Furthermore, integrating data from diverse resources, including AlphaFold DB, Protein Data Bank in Europe (PDBe) (Armstrong et al., 2020), SWISS-MODEL (Waterhouse et al., 2018), PED, and other data resources, ensures that researchers have a holistic view of protein structures and can make informed decisions in their investigations. Basic Protocol 1 shows how to access and retrieve data and summaries using the 3D-Beacons API programmatically. Basic Protocol 2 describes how to search for structures and fetch them straight into a workflow using JalView. Basic Protocol 3 demonstrates how to find macromolecular structures using the sequence search functionality of 3D-Beacons. Finally, Basic Protocol 4 highlights how to facilitate access to data from providers interested in making their structures available through the 3D-Beacons Network. Extensive documentation is available online on the 3D-Beacons repository (https://github.com/3D-Beacons). In addition, we offer a complementary resource in the form of a notebook (https://colab.research.google.com/github/3D-Beacons/3D-Beacons/blob/main/Tutorials/Harnessing_3DBeaconsAPI.ipynb) with Python scripts to navigate 3D-Beacons, featuring a similar structure and protocols as the current article.

Basic Protocol 1: PROGRAMMATIC ACCESS TO THE 3D-BEACONS API

This protocol introduces the basic structure to use the 3D-Beacons Hub API. The 3D-Beacons platform (https://3d-beacons.org) offers programmatic access through its REST API, enabling users to retrieve individual entries and perform database searches. The comprehensive documentation of the 3D-Beacons Hub API, available at https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/#/, follows the OpenAPI 3 specifications and is presented in a Swagger representation. This documentation is a valuable resource, providing detailed information and guidelines for utilizing the 3D-Beacons API effectively, facilitating seamless integration and exploration of spatial biological data. For more information and access to the sample codes, a notebook is available at https://colab.research.google.com/github/3D-Beacons/3D-Beacons/blob/main/Tutorials/Harnessing_3DBeaconsAPI.ipynb).

Necessary Resources

Hardware

A computer capable of running Python code and with a stable Internet connection

Software

1.Open a terminal and install the necessary Python libraries:

  • pip install ijson, wget

2.To get all macromolecular structures for a single entity, create a new Python file, add the following sample code, and save the file.

  • import ijson
  • from urllib.request import urlopen
  • Uniprot_ID = "P04637"
  • WEBSITE_API = ".ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/"
  • r = ijson.parse(urlopen(f"{WEBSITE_API}{Uniprot_ID}.json"))
  • structures = list(ijson.items(r, "structures.item", use_float=True))
  • for structure in structures:
  • print(structure)

Note
Note that the above syntax must be used to retrieve a single entry from 3D-Beacons, with https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{identifier}, where the “identifier” is a UniProt accession. The example here uses UniProt accession P04637, corresponding to the cellular tumor antigen p53. Running the code retrieves all the structures for the cellular tumor antigen p53 (Uniprot: P04637) in JSON format and prints the metadata in the terminal.

3.To perform a model filter, create a new Python file, copy the sample code below, and save the file.

  • import ijson

  • from urllib.request import urlopen

  • WEBSITE_API = ".ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/"

  • Uniprot_ID = "P04637"

  • model = "TEMPLATE-BASED"

  • r = ijson.parse(urlopen(f"{WEBSITE_API}{Uniprot_ID}.json"))

  • structures = list(ijson.items(r, "structures.item", use_float=True))

  • for structure in structures:

  • model_category = structure.get("summary", {}).get("model_category")

  • if model_category == model:

  • print(structure)

Note
Structures can be filtered according to model category. The 3D-Beacons Network classifies the models as experimentally determined, conformational ensemble, template-based, and ab-initio.

Note
The sample code above demonstrates how to perform a filtered search using a model category. Multiple filter parameters can be combined. The sample code retrieves all the conformational ensembles for the cellular tumor antigen p53 (UniProt: P04637) from data providers such as the PED in JSON format. Note that the allowed model categories are EXPERIMENTALLY DETERMINED, CONFORMATIONAL ENSEMBLE, TEMPLATE-BASED, and AB-INITIO. More information can be found in the data specification document (https://3dbeacons.docs.apiary.io/#).

4.Retrieve and rank non-PDBe models based on average confidence scores using the following code:

  • import ijson

  • from urllib.request import urlopen

  • WEBSITE_API = "bi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/"

  • Uniprot_ID = "P04637"

  • provider_filterout = "PDBe"

  • r = ijson.parse(urlopen(f"{WEBSITE_API}{Uniprot_ID}.json"))

  • structures = list(ijson.items(r, "structures.item", use_float=True))

  • filtered_structures = []

  • for structure in structures:

  • provider = structure.get("summary", {}).get("provider")

  • if provider != provider_filterout:

  • if structure.get("summary", {}).get("confidence_avg_local_score") is not None:

  • filtered_structures.append(structure)

  • sorted_structures = sorted(filtered_structures, key=lambda x: x.get("summary", {}).get("provider"), reverse=False)

  • top5_structures = sorted_structures[:5]

  • for structure in top5_structures:

  • print(structure)

Note
Validation metrics provide objective measures to assess the accuracy, precision, and overall quality of predicted protein models. In the realm of protein structure prediction, these metrics are crucial for evaluating model reliability. 3D-Beacons simplifies this process by offering a standardized and unified format for accessing and comparing validation metrics from different providers.

Note
The above code retrieves protein models, excluding those from the PDBe, from the 3D-Beacons Network. Next, the models are sorted based on their average confidence local score, which provides an indication of the model's quality. The higher the average score, the more reliable the model is likely to be. The sorting is done in descending order, with the highest scores appearing first. Finally, the code displays the top five models with the highest average confidence local scores. These models represent the most promising candidates in terms of their overall quality and suitability for further analysis or research.

Note
In sum, the code retrieves, filters, sorts, and presents the protein models from the 3D-Beacons Network in JSON format, providing valuable insights into models that are not from PDBe and highlighting the top candidates based on their average confidence local scores.

5.Perform a model filter, sort results by coverage, and fetch the model with the highest coverage using the following code:

  • import ijson, wget

  • from urllib.request import urlopen

  • WEBSITE_API = "https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/"

  • Uniprot_ID = "P04637"

  • model = "TEMPLATE-BASED"

  • response = urlopen(f"{WEBSITE_API}{Uniprot_ID}.json")

  • r = ijson.parse(response)

  • structures = list(ijson.items(r, "structures.item", use_float=True))

  • structures.sort(key=lambda x: x.get("summary", {}).get("coverage", 0), reverse=True)

  • highest_coverage_structure = None

  • for structure in structures:

  • model_category = structure.get("summary", {}).get("model_category")

  • if model_category == model:

  • highest_coverage_structure = structure

  • break

  • if highest_coverage_structure is not None:

  • print(highest_coverage_structure)

  • model_download = highest_coverage_structure.get("summary", {}).get("model_identifier")

  • for structure in structures:

  • model = structure.get("summary", {}).get("model_identifier")

  • if model == model_download:

  • model_url = structure.get("summary", {}).get("model_url")

  • wget.download(model_url)

Note
All Beacons provide data according to the 3D-Beacons data specifications. For more information, visit the specifications document (https://github.com/3D-Beacons/3d-beacons-specifications/blob/main/oas3.yaml).

Note
The sample code above demonstrates how to sort the conformational ensembles according to coverage, defined as a fraction in the range of [0, 1] of the UniProt sequence covered by the model; this is calculated as (uniprot_end - uniprot_start + 1) / uniprot_sequence_length. The code sorts and saves the structure with the highest coverage, prints the metadata, and saves the file in the working directory using the “model_identifier” as prefix for the filename.

Note
The code retrieves all the available conformational ensembles for the cellular tumor antigen p53 (UniProt: P04637) in JSON format and saves the model with the highest coverage in the working directory in the coordinate format supported by the individual beacon.

6.To filter by provider and fetch the highest-resolution experimental structures from the PDB, create a new Python file, copy the sample code below, and run it.

  • import ijson, wget

  • from urllib.request import urlopen

  • WEBSITE_API = "https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/"

  • Uniprot_ID = "P04637"

  • provider_search = "PDBe"

  • resolution_search = 2

  • r = ijson.parse(urlopen(f"{WEBSITE_API}{Uniprot_ID}.json"))

  • structures = list(ijson.items(r, "structures.item", use_float=True))

  • high_resolution_structures = []

  • for structure in structures:

  • provider = structure.get("summary", {}).get("provider")

  • resolution = structure.get("summary", {}).get("resolution")

  • if provider == provider_search and resolution is not None and resolution < resolution_search:

  • Append the structure to the list without assigning the result back to the list

  • high_resolution_structures.append(structure)

  • for structure in high_resolution_structures:

  • model_url = structure.get("summary", {}).get("model_url")

  • wget.download(model_url)

  • print("Downloading:", model_url)

Note
This code will filter for experimentally determined structures for the cellular tumor antigen p53 (UniProt: P04637) in the PDBe and will download models with a resolution higher than 2Å.

Note
Running the code downloads all the available experimentally determined structures for the cellular tumor antigen p53 (UniProt: P04637) in the working directory.

7.Retrieve Ensembl summary via 3D-Beacons by creating a new Python file, copying the sample code below, and running it.

  • import ijson, wget

  • from urllib.request import urlopen

  • WEBSITE_API = "https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/ensembl/summary/"

  • ENSEMBL_ID = "ENSG00000288864"

  • r = ijson.parse(urlopen(f"{WEBSITE_API}{ENSEMBL_ID}.json"))

  • ensembls = ijson.items(r, "uniprot_mappings.item", use_float=True)

  • for ensembl in ensembls:

  • print(ensembl)

Note
The 3D-Beacons Network provides a resource for retrieving protein structures and associated summary data to explore the diverse transcript variants associated with a given gene. Ensembl provides information about the different transcript variants for a given gene that map to a UniProt ID. In the JSON format, each transcript variant is assigned a distinct identifier known as “transcript_id”. The information associated with each transcript variant includes the genomic coordinates (seqRegionStart and seqRegionEnd) of the transcript and the available models for each transcript.

Note
The sample code above shows how to search the Ensembl genome browser (Cunningham et al., 2022); this functionality requires an Ensembl gene ID to query through the network. The following example retrieves the summary data for lysine acetyltransferase 6A (KAT6A; Ensembl Gene ID: ENSG00000083168).

Note
The code example generates a list, in JSON format, of all the available transcripts with a UniProt identifier and retrieves the corresponding structure models.

Basic Protocol 2: SEQUENCE-BASED SEARCH USING THE 3D-BEACONS API

The 3D-Beacons Network has introduced Sequence Similarity Search functionality, which allows users to query the network using the amino acid sequence of a protein. It is important to note that the Sequence Similarity Search option only accepts standard amino acids and does not support DNA or RNA sequences. The Sequence Similarity Search option available through the network uses the Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) to find regions of sequence similarity by aligning them with a query sequence. This alignment process allows for the statistical assessment of the degree of similarity between the query sequence and sequences in the network. By evaluating the match between the network and query sequence, valuable insights into the structure, function, and evolutionary aspects can be obtained, thus facilitating targeted and systematic exploration of protein structures.

The protocol presented below illustrates the process of performing a sequence-based query on the 3D-Beacons Network, employing the POST and GET methods. In this protocol, the POST method is used to transmit data from the client to the server, whereas the GET method is employed to obtain the results from the server.

Necessary Resources

  • See Basic Protocol 1.

1.Open a terminal and install the necessary Python libraries:

  • pip install ijson, wget

2.Get all the models in the 3D-Beacons Network that aligns with the query by creating a new Python file, copying the sample code below, and running it.

  • import requests

  • import ijson

  • POST_WEBSITE = "https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/sequence/search"

  • GET_WEBSITE = "https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/sequence/result"

  • query_sequence = {"sequence": "MNMLVINGTPRKHGRTRIAASYIAALYHTA"}

  • response = requests.post(POST_WEBSITE, json=query_sequence)

  • if response.status_code == 200:

  • print("POST request successful")

  • job_id = response.json()["job_id"]

  • else:

  • print(f"POST request failed with status code {response.status_code}")

  • exit()

  • response = requests.get(f"{GET_WEBSITE}?job_id={job_id}")

  • if response.status_code == 200:

  • for item in ijson.items(response.content,"item"):

  • print(item)

  • else:

  • print(f"GET request failed with status code {response.status_code}")

Note
The sample code above demonstrates how to perform a sequence-based search in the 3D-Beacons Network. The sample code retrieves all the models from different providers that are aligned to the query_sequence = MNMLVINGTPRKHGRTRIAASYIAALYHTA.

Note
The code prints all the available structures in the network from different providers in JSON format for downstream analysis.

Basic Protocol 3: ACCESSING MACROMOLECULES FROM 3D-BEACONS WITH JalView

This protocol shows how to search for 3D-Beacons models through JalView (Procter et al., 2021). JalView is a versatile and accessible program for Multiple Sequence Alignment (MSA) editing, visualization, and analysis. By integrating the 3D-Beacons Hub API with JalView, its users can explore and discover 3D models for protein alignments sourced from UniProt.

Necessary Resources

Hardware

A computer capable of supporting a web browser and an Internet connection

Software

JalView v2.11+ (https://www.jalview.org/download/) installed locally

1.Launch JalView.

2.Select sequence ID to find available structures on 3D-Beacons. To view available 3D structures for the currently selected set of sequences, open the pop-up menu of the Sequence ID panel and choose the “3D Structure Data…” option (Fig. 1).

Note
In this example, an MSA of cyclin-dependent kinases has been prepared for demonstration purposes and is available in the 3D-Beacons repository (https://raw.githubusercontent.com/3D-Beacons/3D-Beacons/main/Tutorials/AA_MSAkinase.fasta). In order to use the MSA file on JalView, navigate to the top menu File > Input Alignment > from URL.

The JalView software interface, showing the drop-down options displayed when selecting a sequence.
The JalView software interface, showing the drop-down options displayed when selecting a sequence.

3.Query the 3D-Beacons Network.

Note
After selecting the option “3D Structure Data…”, JalView queries all available structures from different data providers. In order to search the 3D-Beacons Network, a UniProt accession ID must be available. If structures are available, the “Search 3D-Beacons” button will be shown at the top of the Structure Chooser window (Fig. 2).

The JalView software interface after selecting “3D Structure Data…,” showcasing the search functionality within the 3D-Beacons Network.
The JalView software interface after selecting “3D Structure Data…,” showcasing the search functionality within the 3D-Beacons Network.

4.View and set filtering options for structures in 3D Beacons using the Structure Chooser.

Note
When using the Structure Chooser in 3D Beacons, users have various options for selecting structures. These options can be accessed through the drop-down filter located at the top of the Structure Chooser window (Fig. 3). The available options typically include the following:

Note
Best 3D-Beacons Coverage: This option selects one structure for each sequence from the available structures based on the highest quality. It prioritizes experimental structures or highly accurate predictions that cover the maximum number of residues in the alignment.

Note
3DB Provider (3D-Beacons Provider): This option allows you to select the best-quality structure for each sequence from a specific resource, such as PDB, AlphaFoldDB, or SwissModel. It provides structures specifically from the chosen provider.

Note
Multiple 3D-Beacons Coverage: This option makes use of a heuristic approach to identify structures that offer the best-quality structure data for every position in each sequence. However, it is important to note that using this option may result in failed structure superpositions because the selected structures for each sequence may not overlap.

Screenshot of the JalView software interface displaying the available filtering options for structures from the 3D-Beacons Network. The Structure Chooser in JalView software offers three distinct options for filtering structures from the 3D-Beacons Network. The “Best 3D-Beacons Coverage” option prioritizes high-quality structures, including experimental ones and accurate predictions, that cover a maximum number of residues in the alignment. The “3DB Provider” option allows users to select the top-quality structure for each sequence from a specific provider, such as PDB, AlphaFoldDB, or SwissModel. Finally, the “Multiple 3D-Beacons Coverage” option identifies structures that provide the best-quality data for every position in each sequence.
Screenshot of the JalView software interface displaying the available filtering options for structures from the 3D-Beacons Network. The Structure Chooser in JalView software offers three distinct options for filtering structures from the 3D-Beacons Network. The “Best 3D-Beacons Coverage” option prioritizes high-quality structures, including experimental ones and accurate predictions, that cover a maximum number of residues in the alignment. The “3DB Provider” option allows users to select the top-quality structure for each sequence from a specific provider, such as PDB, AlphaFoldDB, or SwissModel. Finally, the “Multiple 3D-Beacons Coverage” option identifies structures that provide the best-quality data for every position in each sequence.

5.Retrieve structure by selecting the desired structure from the Structure Chooser and then pressing the “Enter” key to retrieve and open it in JalView.

Note
This will allow you to view and analyze the selected structure in the context of your workflow.

Basic Protocol 4: ENHANCING DATA ACCESSIBILITY THROUGH 3D-BEACONS

This protocol will introduce the basic navigational techniques needed to browse the 3D-Beacons website. It outlines the recommended steps for pushing and making data accessible through the 3D-Beacons Client server. This protocol serves as a guide for researchers and data providers to effectively contribute their data to the 3D-Beacons ecosystem. By adhering to this protocol, users can ensure seamless integration and discoverability of their datasets within the 3D-Beacons platform. The protocol covers the processes of data preparation, metadata description, data formatting, and the actual data upload to the 3D-Beacons Client server. It also highlights the recommended practices for ensuring data accessibility, including the use of standardized file formats, providing comprehensive metadata, and complying with data sharing policies. Following this protocol will enable data providers to maximize the visibility and impact of their 3D spatial datasets within the 3D-Beacons Network, fostering collaboration and knowledge exchange in the field of spatial biology and beyond.

To successfully process a model, both a PDB, PDBx/mmCIF, or modelCIF and a corresponding JSON file containing metadata mapping the model to a UniProt entry are required. It is essential to ensure that the related files have identical names, such as “HAT_1.pdb” and “HAT_1.json”. For this protocol, one model dataset is given within the repository: P38398_1jm7.1.A_1_103.pdb and P38398_1jm7.1.A_1_103.json.

Data providers who are interested in making their macromolecule structures available through the 3D-Beacons Network should contact the consortium to have their models added to the 3D-Beacons registry. For more details and step-by-step instructions, please refer to this documentation: https://github.com/3D-Beacons/3d-beacons-registry.

Necessary Resources

Hardware

A computer capable of supporting a web browser and an Internet connection

Software

1.To obtain the complete infrastructure to make structural models available, clone the 3D-Beacons Client repository with the following command and navigate to the working directory:

  • mkdir -p./data/{pdb,mmcif,metadata,index}
  • cp tests/data/pdb/P38398_1jm7.1.A_1_103.pdb./data/pdb/
  • cp tests/data/metadata/P38398_1jm7.1.A_1_103.json./data/metadata/

2.Generate the necessary directories:

  • mkdir -p./data/{pdb,mmcif,metadata,index}
  • cp tests/data/pdb/P38398_1jm7.1.A_1_103.pdb./data/pdb/
  • cp tests/data/metadata/P38398_1jm7.1.A_1_103.json./data/metadata/

Note
Deposited models need a PDB or PDBx/mmCIF file and a JSON file that contains the metadata. Both files must have the same name. For this example, the file names are P38398_1jm7.1.A_1_103.pdb and P38398_1jm7.1.A_1_103.json. The above code will create the necessary directories (pdb, mmcif, metadata, and index) within the working directory. Once the directory is set up, it will copy the example files into the appropriate location by running the above script.

3.Set up the local environment:

  • a. Copy the provided example file to the working directory.
  • b. Open the file and update the variables “MONGO_PASSWORD” and “PROVIDER”.
  • cp .env.example .env
  • nano .env

Note
The PROVIDER is the name that was given when registering the to the 3D-Beacons Network. The MONGO_PASSWORD will be the password used for connecting to the MongoDB instance.

Note
Setting up the local environment is necessary to ensure that you have all the required software and dependencies installed to run the code and access the necessary resources. It provides a consistent and controlled environment for executing the code and ensures that you have the necessary tools and configurations to replicate the experiments or workflows described in the article.

4.Start docker containers:

  • docker-compose up -d

Note
Starting docker containers is essential, as it ensures reproducibility, isolation, portability, and efficient resource management. The containers isolate the application and its dependencies, preventing conflicts with the host system and enabling the application to run independently. Docker containers provide a consistent and controlled environment, allowing the application or service to run in the same conditions as intended by the authors of the article.

5.Process the model PDB files:

  • docker-compose exec cli snakemake --cores = 2

Note
To facilitate the data processing pipeline, the above command processes the PDB files. Firstly, it converts PDB files into PDBx/mmCIF files. It then converts both PDBx/mmCIF and metadata files into JSON index files. Finally, it loads the generated JSON index files into the MongoDB database. This command streamlines the conversion and indexing process, enabling seamless integration and efficient storage of the data within the MongoDB database.

6.Perform database verification:

Note
To ensure the integrity of the files, a formal approach involves querying the database through the API to verify the files. This process involves accessing the data stored within the database and cross-referencing the data with the corresponding files.

COMMENTARY

Background Information

3D-Beacons is an open collaboration that addresses the challenges of finding, accessing, and integrating all relevant macromolecular structure models from diverse providers. By establishing a standardized framework, 3D-Beacons offers researchers a unified platform for accessing meta-information and model coordinates from experimentally determined structures, ab-initio models, template-based models, and conformational ensembles. The network links data from multiple providers (Table 1). This collaborative effort ensures that a wide range of protein structure data is available in a standardized format, facilitating seamless integration into scientific workflows.

Table 1. Members of the 3D-Beacons Network as of March 2024
Data provider Model category Number of structures
AlphaFill Template based 995,411
AlphaFold DB Ab initio 214,684,311
HegeLab Ab initio 18
isoform.io Ab initio 237,275
ModelArchive Ab initio/template based 616,917
PDBe Experimentally determined 217,387
PED Conformation ensembles 305
SASBDB Experimentally determined 4073
SWISS-MODEL repository Template based 2,570,296

Through 3D-Beacons, researchers gain access to a comprehensive repository that combines the expertise and resources of multiple providers. For instance, experimentally determined structures offer valuable insights into the three-dimensional arrangements of proteins, and ab-initio models provide predictions based on computational algorithms. However, conformational ensembles capture the flexibility and dynamics of protein structures, enhancing our understanding of their functional properties. By incorporating data from diverse providers, 3D-Beacons offers a rich and varied collection of structure models, enabling researchers to explore different perspectives and uncover novel insights into protein structure and function.

Critical Parameters

The 3D-Beacons Network supports API endpoints keyed on the following information:

  • UniProt accessions
  • Protein sequences (i.e., sequences of one-letter amino acid codes)
  • Ensembl identifiers (IDs start with ENS for Ensembl and then a G for gene).

Troubleshooting

Table 2 displays the response codes of the API and actions that can be taken to mitigate their effects. Requests for clarifications or reporting new errors can be made by contacting pdbekb_help@ebi.ac.uk.

Table 2. Troubleshooting Guide for API Response Codes
Problem/response code Possible cause Solution
202
  • 1. Accepted request - The sequence search submission was correct, and the job has been assigned a job identifier.
  • 2. Accepted request - The sequence search job is currently running.
Please wait until the sequence search run completes. It can take 5-10+ min.
400
  • 1. Bad request - Malformed UniProt accession.
  • 2. Bad request - Possibly invalid sequence.
  • 3. Bad request - Job identifier not found.

1. Please check that the input UniProt accession is correct.

2. Please check your input sequence and retry the submission.

3. Please check if your job identifier is correct.

404 Not found - No results found for the given request There may be no results for a specific UniProt accession or protein sequence
500 Internal server error This error might be due to scheduled maintenance or, rarely, technical issues. Please try again later. If the issue persists, please email pdbekb_help@ebi.ac.uk.

Understanding Results

The 3D-Beacons Hub API responses return JSON objects that all modern programming and scripting languages can parse. Throughout the examples presented here, we demonstrate how to parse the JSON responses using Python.

To more easily understand the JSON response, we advise reviewing the 3D-Beacons API specification available at Apiary: https://3dbeacons.docs.apiary.io/#. This interactive documentation shows the latest released specification and defines every field, including their types, ranges, and examples. Previous versions of the specification are available from GitHub: https://github.com/3D-Beacons/3d-beacons-specifications.

The key information from the responses is the URLs to the model coordinate files captured in the “model_url” field. All the other fields describe the metadata associated with the model, from quality metrics to species and sequence information.

Time Considerations

The 3D-Beacons Hub API responses vary based on the input type. Generally, API endpoints keyed on unique identifiers, such as UniProt accessions, will return responses in seconds, whereas the sequence-based search might take up to 10 to 15 min.

Acknowledgments

The 3D-Beacons infrastructure was initially funded by the BBSRC grant BB/S020071/1, and its continued development and maintenance are funded by Wellcome Trust 223739/Z/21/Z. We also acknowledge funding from Google DeepMind, which supports the creation of training materials.

Open access funding enabled and organized by Projekt DEAL.

Author Contributions

Paulyna Magana : Software; visualization; writing—original draft; writing—review and editing. Sreenath Nair : Software; writing—review and editing. Mihaly Varadi : Project administration; Supervision; writing—original draft; writing—review and editing. Sameer Velankar : Conceptualization; funding acquisition; writing—review and editing.

Conflict of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

Documentation of the 3D-Beacons Hub API is available at https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/. The specification of the data exchange format is available at https://3dbeacons.docs.apiary.io/#. The code base of the 3D-Beacons client, shown in Basic Protocol 4, is available at https://github.com/3D-Beacons/3d-beacons-client. The Jupyter notebook accompanying the protocols shown here is available at https://colab.research.google.com/github/3D-Beacons/3D-Beacons/blob/main/Tutorials/Harnessing_3DBeaconsAPI.ipynb. The MSA for use on JalView is available at https://raw.githubusercontent.com/3D-Beacons/3D-Beacons/main/Tutorials/AA_MSAkinase.fasta.

Literature Cited

  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology , 215(3), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
  • Armstrong, D. R., Berrisford, J. M., Conroy, M. J., Gutmanas, A., Anyango, S., Choudhary, P., Clark, A. R., Dana, J. M., Deshpande, M., Dunlop, R., Gane, P., Gáborová, R., Gupta, D., Haslam, P., Koča, J., Mak, L., Mir, S., Mukhopadhyay, A., Nadzirin, N., … Velankar, S. (2020). PDBe: Improved findability of macromolecular structure data in the PDB. Nucleic Acids Research , 48(D1), D335–D343. https://doi.org/10.1093/nar/gkz990
  • Cunningham, F., Allen, J. E., Allen, J., Alvarez-Jarreta, J., Amode, M. R., Armean, I. M., Austine-Orimoloye, O., Azov, A. G., Barnes, I., Bennett, R., Berry, A., Bhai, J., Bignell, A., Billis, K., Boddu, S., Brooks, L., Charkhchi, M., Cummins, C., da Rin Fioretto, L., … Flicek, P. (2022). Ensembl 2022. Nucleic Acids Research , 50(D1), D988–D995. https://doi.org/10.1093/nar/gkab1049
  • Ghafouri, H., Lazar, T., del Conte, A., Tenorio Ku, L. G., PED Consortium, Tompa, P., Tosatto, S. C. E., & Monzon, A. M. (2024). PED in 2024: Improving the community deposition of structural ensembles for intrinsically disordered proteins. Nucleic Acids Research , 52(D1), D536–D544. https://doi.org/10.1093/nar/gkad947
  • Hekkelman, M. L., de Vries, I., Joosten, R. P., & Perrakis, A. (2021). AlphaFill: Enriching the AlphaFold models with ligands and co-factors (p. 2021.11.26.470110). bioRxiv. https://doi.org/10.1101/2021.11.26.470110 bioRxiv
  • Kikhney, A. G., Borges, C. R., Molodenskiy, D. S., Jeffries, C. M., & Svergun, D. I. (2020). SASBDB: Towards an automatically curated and validated repository for biological scattering data. Protein Science , 29(1), 66–75. https://doi.org/10.1002/pro.3731
  • Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science , 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574
  • Procter, J. B., Carstairs, G. M., Soares, B., Mourão, K., Ofoegbu, T. C., Barton, D., Lui, L., Menard, A., Sherstnev, N., Roldan-Martinez, D., Duce, S., Martin, D. M. A., & Barton, G. J. (2021). Alignment of biological sequences with Jalview. Methods in Molecular Biology , 2231, 203–224. https://doi.org/10.1007/978-1-0716-1036-7_13
  • Rambla, J., Baudis, M., Ariosa, R., Beck, T., Fromont, L. A., Navarro, A., Paloots, R., Rueda, M., Saunders, G., Singh, B., Spalding, J. D., Törnroos, J., Vasallo, C., Veal, C. D., & Brookes, A. J. (2022). Beacon v2 and Beacon networks: A ‘lingua franca’ for federated data discovery in biomedical genomics, and beyond. Human Mutation , 43(6), 791–799. https://doi.org/10.1002/humu.24369
  • Sommer, M. J., Cha, S., Varabyou, A., Rincon, N., Park, S., Minkin, I., Pertea, M., Steinegger, M., & Salzberg, S. L. (2022). Structure-guided isoform identification for the human transcriptome. eLife , 11, e82556. https://doi.org/10.7554/eLife.82556
  • Tordai, H., Suhajda, E., Sillitoe, I., Nair, S., Varadi, M., & Hegedus, T. (2022). Comprehensive collection and prediction of ABC transmembrane protein structures in the AI era of structural biology. International Journal of Molecular Sciences , 23(16), 8877. https://doi.org/10.3390/ijms23168877
  • Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., Yuan, D., Stroe, O., Wood, G., Laydon, A., Žídek, A., Green, T., Tunyasuvunakool, K., Petersen, S., Jumper, J., Clancy, E., Green, R., Vora, A., Lutfi, M., … Velankar, S. (2022). AlphaFold protein structure database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research , 50(D1), D439–D444. https://doi.org/10.1093/nar/gkab1061
  • Varadi, M., Nair, S., Sillitoe, I., Tauriello, G., Anyango, S., Bienert, S., Borges, C., Deshpande, M., Green, T., Hassabis, D., Hatos, A., Hegedus, T., Hekkelman, M. L., Joosten, R., Jumper, J., Laydon, A., Molodenskiy, D., Piovesan, D., Salladini, E., … Velankar, S. (2022). 3D-Beacons: Decreasing the gap between protein sequences and structures through a federated network of protein structure data resources. GigaScience , 11, giac118. https://doi.org/10.1093/gigascience/giac118
  • Velankar, S., Burley, S. K., Kurisu, G., Hoch, J. C., & Markley, J. L. (2021). The protein data bank archive. Methods in Molecular Biology , 2305, 3–21. https://doi.org/10.1007/978-1-0716-1406-8_1
  • Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F. T., de Beer, T. A. P., Rempfer, C., Bordoli, L., Lepore, R., & Schwede, T. (2018). SWISS-MODEL: Homology modelling of protein structures and complexes. Nucleic Acids Research , 46(W1), W296–W303. https://doi.org/10.1093/nar/gky427
  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data , 3(1), 160018. https://doi.org/10.1038/sdata.2016.18

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询