Translational Data Analytics
Our research activities involve the development of novel analysis methodologies of various types of research data and the application of such novel methodologies — or existing ones — to various translational research projects.
Translational Data Analytics
Contact
We provide informatics support for conducting consortium-wide projects and analyses.
NCATS Biomedical Data Translator
https://github.com/NCATSTranslator
https://ui.transltr.io/demo
The Biomedical Data Translator program is a consortium of NCATS and extramural data science researchers that supports the integration of existing medical and biological data sources to produce tools for understanding the pathophysiology of human disease to augment human reasoning and inference. The informatics backbone of this effort is the development of community standards for data reuse, including Biolink as a semantic standard, Smart-API for discoverability and the Translator Reasoner API as a communication standard.
Cheminformatics and Other Utilities
Molwitch
https://github.com/ncats/molwitch
Molwitch is a cheminformatics bridge layer application programming interface (API) that allows users to switch the underlying cheminformatics library, such as Jchem, CDK or Indigo, without having to recompile their code.
OpenData Renderer
https://github.com/ncats/molwitch-renderer
https://opendata.ncats.nih.gov/renderer/render(N%5BC@@H%5D(CC1%3DCC(I)%3DC(OC2%3DCC(I)%3DC(O)C(I)%3DC2)C(I)%3DC1)C(O)%3DO)?size=450
https://opendata.ncats.nih.gov/resolver/_options
Molwitch-renderer takes in a chemical structure in molfile or smiles format and produces a rendered image of that structure. The software uses the Molwitch library.
Stitcher
https://github.com/ncats/stitcher
Stitcher provides a graph-based approach to entity stitching and resolution using clique detection. This software currently is used to support work on providing reference data sets for drugs and rare diseases.
Structure Indexer
https://github.com/ncats/structure-indexer
Structure Indexer is an inverted index data structure to support fast structure searching. The implementation is based on Apache Lucene. The software can be used as a standalone or embedded within a service. It currently is used by the Global Substance Registration System (G-SRS) software.
Support for NCATS Scientific Computing
The Informatics (IFX) Core produces customized computational workflows to enable and streamline the analysis of data obtained from novel technologies (e.g., metabolomics, RASL-Seq, etc.). These workflows are then embedded within the NCATS scientific computing environment to meet the needs of DPI. These methodologies could readily be embedded in other environments as well.
Examples of customized computational workflows include bulk and single-cell RNA sequencing pipelines, high-throughput screening analyses using Spotfire, compound registration and management, and comprehensive metabolomic profiling.
Collaborative Research Efforts
The IFX Core applies state-of-the-art analysis methodologies, some of which are developed by our group, to large molecular and -omics data sets collected in translational research. Generally, we aim to identify molecules (e.g., DNA, RNA, proteins, metabolites, etc.) that identify cellular and disease states and to facilitate interpretation of these complex data to further our knowledge of biological mechanisms underlying disease and cellular mechanisms.
Single-Cell Sequencing Techniques to Gain Insights into Small-Molecule Chemical Biology
- Evaluate stem cell differentiation through single- and multi-compound studies to optimize for the intended cellular fate, including cell type classification and tracking marker gene sets through differentiation time courses.
- Evaluate cellular response to small molecules in cancer models to understand cell type–specific responses and response heterogeneity.
Machine Learning and Descriptive Modeling
The IFX Core engages in various research efforts where machine learning (ML) methods are applied in new ways or novel ML approaches are developed. Our goal is to translate big biomedical data into knowledge for supporting clinical and preclinical research systematically using state-of-the-art computational techniques. Our approaches allow us to collect, integrate and analyze large and diverse bodies of preclinical and clinical data, which can help researchers prioritize therapeutic hypotheses and reveal hidden relations between drugs, targets and diseases.
Prediction of Chemical Properties
- Building machine learning models to predict a wide range of ADME (Absorption, Distribution, Metabolism and Excretion) properties (e.g. rat liver microsomal stability, parallel artificial membrane permeability assays, kinetic aqueous solubility, cytochrome P450 mediated metabolism) using the ADME database
- Curated multispecies acute toxicity data, primarily focusing on the various endpoints, such as lethal dose 50, lethal dose low and toxic dose low. The data were obtained from ChemIDPlus. We developed multitask prediction models using random forests, deep neural networks and graph-based neural networks
- Curated bioactivity data for hERG channel inhibition; the data were obtained from ChEMBL and integrated with NCATS’ in-house data from a thallium-flux assay, a high-throughput assay for measuring hERG channel activity. We provide prediction models built on the integrated data set using both classical and modern AI approaches.