Informatics Scientific Software and Resources

Software

Informatics scientists develop software and utilities for a wide variety of scientific applications that span scientific data management and high-throughput data analyses. They also support the back-end databases and services on which many tools and applications depend.

Hilbert-Curve Assisted Structure Embedding (HCASE)

https://github.com/ncats/hcase.git

The Hilbert-Curve Assisted Structure Embedding (HCASE) method is able to create a 2-D map of molecules that can be interpreted intuitively by medicinal chemists. Unlike other methods, 2-D maps created by HCASE will place the molecules in specific segments of the map according to a well-defined ordering of scaffolds, which is reflective of the medicinal chemist’s thought process.

Integration of -Omics Data Through Linear Modeling (IntLIM)

Code: https://github.com/ncats/IntLIM

Interface: https://intlim.ncats.io/

Integration of -omics data through LInear Modeling (IntLIM) is a user-friendly app that supports integration of multi-omics data (e.g., gene expression and metabolomics data). The software identifies analyte relationships, such as gene-metabolite associations, that are specific to a given phenotype (e.g., cancer vs. non-cancer).

Layered Chemical Identifier (LyChI)

https://github.com/ncats/lychi

The Layered Chemical Identifier (LyChI) is a chemical standardization tool that generates a unique hash for chemicals that is layered and used for quick fuzzy uniqueness checks and searches. A unique feature of the LyChI hash keys is that they are, to a certain extent, lexicologically meaningful.

MolVec

Code: https://github.com/ncats/molvec

Interface: https://molvec.ncats.io

MolVec is an optical chemical structure recognition software that converts images into structured data for computation. The software can take images of chemical rendering in a variety of formats (e.g., PNG, TIFF, GIF) as input and produces vectorized 2-D formats (e.g., SDF) that faithfully reconstruct the drawn structures. MolVec currently is considered one of the most accurate open-source tools for this task.

Molwitch

https://github.com/ncats/molwitch

Molwitch is a cheminformatics bridge layer application programming interface (API) that allows users to switch the underlying cheminformatics library, such as Jchem, CDK or Indigo, without having to recompile their code.

Molwitch-renderer

https://github.com/ncats/molwitch-renderer

Molwitch-renderer takes in a chemical structure in molfile or smiles format and produces a rendered image of that structure. The software uses the Molwitch library.

Scaffold Hopper

https://github.com/ncats/scaffold-hopper

Scaffold Hopper allows a user to “hop” between related contexts (i.e., structures, documents, targets, MeSH terms) with a single click. A novel feature of this software is that it can automatically perceive R group decomposition.

Stitcher

https://github.com/ncats/stitcher

Stitcher provides a graph-based approach to entity stitching and resolution using clique detection. This software currently is used to support work on providing reference data sets for drugs and rare diseases.

Structure Indexer

https://github.com/ncats/structure-indexer

Structure Indexer is an inverted index data structure to support fast structure searching. The implementation is based on Apache Lucene. The software can be used as a standalone or embedded within a service. It currently is used by the Global Substance Registration System (G-SRS) software.

NCATS Biomedical Data Translator

https://github.com/NCATSTranslator

The Biomedical Data Translator program is a consortium of NCATS and extramural data science researchers that supports the integration of existing medical and biological data sources to produce tools for understanding the pathophysiology of human disease to augment human reasoning and inference. The informatics backbone of this effort is the development of community standards for data reuse, including Biolink as a semantic standard, Smart-API for discoverability and the Reasoner API as a communication standard.

Resources

NCATS’ informatics scientists have produced a wide array of publicly available resources that support various areas of translational research.

Acute Toxicity Data Set

https://github.com/ncats/ld50-multitask

This repository contains curated multispecies acute toxicity data, primarily focusing on the various endpoints, such as lethal dose 50, lethal dose low and toxic dose low. The data were obtained from ChemIDPlus. NCATS developed multitask prediction models using random forests, deep neural networks and graph-based neural networks.

ADME@NCATS in the Open Data Portal

https://opendata.ncats.nih.gov/adme

ADME@NCATS provides computational models for predicting absorption, distribution, metabolism and excretion (ADME) endpoints that are potentially useful in early drug discovery for structure optimization. The data set used for modeling was obtained from in vitro assays performed at NCATS.

BioPlanet

https://tripod.nih.gov/bioplanet/

BioPlanet is a comprehensive, publicly accessible informatics resource that catalogues annotations and relationships between pathways, healthy and disease states, and targets. BioPlanet integrates pathway annotations from publicly available, manually curated sources that have been subjected to thorough redundancy and consistency cross-evaluation via extensive manual curation. The browser supports interactive browsing; retrieval and analysis of pathways; exploration of pathway connections; and pathway search by gene targets, category and availability of bioactivity assays.

COVID-19 in the Open Data Portal

https://opendata.ncats.nih.gov/covid19

The COVID-19 Open Data Portal provides open access to screening data, animal model data and multi-omics data. Informatics scientists at NCATS are contributing to back-end and front-end development of the website and are helping coordinate the data ingestion into the site. This resource enables a variety of drug repurposing activities and allows researchers to formulate hypotheses, prioritize research opportunities, and speed the search for effective therapies against SARS-CoV-2 and COVID-19.

CURE ID

https://cure.ncats.io/

CURE ID, a website and mobile app, is a collaboration between the U.S. Food and Drug Administration (FDA) and NCATS that gives the global clinical community the opportunity to report novel uses of existing drugs for patients with difficult-to-treat infectious diseases through a website or a smartphone or other mobile device. It was developed with support from the Infectious Diseases Society of America, the Centers for Disease Control and Prevention, and the World Health Organization.

Genetic and Rare Diseases Information Center (GARD)

https://rarediseases.info.nih.gov

The Genetic and Rare Diseases Information Center (GARD) provides the public with access to current, reliable and easy-to-understand information about rare or genetic diseases. Informatics scientists organize, synthesize and update the backbone database for GARD. The program is funded by NCATS and the National Human Genome Research Institute.

Global Substance Registration System (G-SRS)

https://gsrs.ncats.nih.gov

G-SRS provides a common identifier for all of the substances used in medicinal products, using consistent definitions of substances globally, including active substances under clinical investigation, consistent with the ISO 11238 standard. G-SRS is a collaborative effort between NCATS and the FDA.

hERG Models and Data Set

https://github.com/ncats/herg-ml

The repository contains curated bioactivity data for hERG channel inhibition; the data were obtained from ChEMBL and integrated with NCATS’ in-house data from a thallium-flux assay, a high-throughput assay for measuring hERG channel activity. NCATS provides prediction models built on the integrated data set using both classical and modern AI approaches.

Matrix Client

https://tripod.nih.gov/matrix-client

NCATS’ Matrix client is a publicly available web-based interface for analysis and access to NCATS combination screening data.

Thousands of combination pairs can be analyzed with this platform in a robust and cost- and time-effective way, enabling NCATS scientists to quickly narrow down a long list of drug pairs to identify the most effective drug combinations for follow-up and clinical studies.

NCATS Inxight Drugs

https://drugs.ncats.io

Inxight Drugs incorporates and unifies manually curated data supplied by the FDA and private companies, and it provides marketing and regulatory status, rigorous drug ingredient definitions, information about biological activity and clinical use, and more. NCATS has developed Inxight data resources to facilitate translational research.

Pharos

https://pharos.nih.gov/

Pharos is a comprehensive, integrated knowledge base for drug discovery and target validation. It was created to help illuminate the uncharacterized or poorly annotated portion of the genome. Pharos is the user interface to the Knowledge Management Center for the Illuminating the Druggable Genome (IDG) program funded by the NIH Common Fund.

Learn more: Pharos: Collating Protein Information to Shed Light on the Druggable Genome.

Predictor

https://predictor.ncats.io/

NCATS Predictor provides chemical structure-property and structure-activity models for drug discovery and development. The models are developed from NCATS’ in-house data sets and published literature, and they use innovative deep-learning approaches to make predictions.

Learn more: Novel Consensus Architecture to Improve Performance of Large-Scale Multitask Deep Learning QSAR Models.

Relational Database of Metabolic Pathways (RaMP)

Code: https://github.com/ncats/RaMP-DB

Interface: https://rampdb.ncats.io

The Relational Database of Metabolic Pathways (RaMP) is a publicly available relational database that integrates multiple sources of biological, chemical and analyte (metabolite, protein, gene) annotations. The source code for building the database is available, and a user-friendly application to query the database is provided. RaMP also supports pathway enrichment analysis for multi-omics data input.

SmartGraph

https://smartgraph.ncats.io

SmartGraph is a predictive network-pharmacology platform that integrates drug-target and protein-protein interactions. Investigators can analyze the perturbation of the protein interaction network caused by single or multiple (small-molecule) agents. SmartGraph enables researchers without an in-depth informatics background to perform analyses by providing powerful network visualization and user-friendly “single click” options to perform complex workflows.

Tox21 Gateway

https://tripod.nih.gov/tox

The Tox21 Gateway is a web-based interface for the analysis of and access to Tox21 screening data. The Tox21 10K compound library has been screened against approximately 70 cell-based assays in qHTS format, generating approximately 100 million data points. The Gateway contains a suite of tools, including a public website for browsing and downloading Tox21 assay data and compound library annotations, such as analytical QC results; an assay tracking system that stores assay metadata and detailed experimental conditions; a structure-activity analysis tool; and links to Tox21 publications and presentations.