Translational Data Analytics

Informatics (IFX) Core scientists in NCATS’ Division of Preclinical Innovation (DPI) are involved in highly collaborative translational research projects that span bioinformatics, cheminformatics, multi-omics, data science, software engineering, biology and chemistry. Our research activities involve the development of novel analysis methodologies of various types of research data (e.g., high-throughput screening, high-throughput sequencing, metabolomics/proteomics, etc.) and the application of such novel methodologies — or existing ones — to various translational research projects.

Read about our research activities:

Machine Learning and Descriptive Modeling

The IFX Core engages in various research efforts where machine learning (ML) methods are applied in new ways or novel ML approaches are developed. Our goal is to translate big biomedical data into knowledge for supporting clinical and preclinical research systematically using state-of-the-art computational techniques. Our approaches allow us to collect, integrate and analyze large and diverse bodies of preclinical and clinical data, which can help researchers prioritize therapeutic hypotheses and reveal hidden relations between drugs, targets and diseases.

Prediction of Chemical Properties

  • Building machine learning models to predict a wide range of ADME properties (e.g. rat liver microsomal stability, parallel artificial membrane permeability assays, kinetic aqueous solubility, cytochrome P450 mediated metabolism) using the ADME database
  • Curated multispecies acute toxicity data, primarily focusing on the various endpoints, such as lethal dose 50, lethal dose low and toxic dose low. The data were obtained from ChemIDPlus. NCATS developed multitask prediction models using random forests, deep neural networks and graph-based neural networks
  • Curated bioactivity data for hERG channel inhibition; the data were obtained from ChEMBL and integrated with NCATS’ in-house data from a thallium-flux assay, a high-throughput assay for measuring hERG channel activity. NCATS provides prediction models built on the integrated data set using both classical and modern AI approaches.

Extracting Knowledge From Data in Rare Diseases

  • Development of Natural Language Processing (NLP)-based approaches to systematically analyze PubMed abstracts, social media, and NIH funding pertinent to rare diseases. These analyses help identify gaps and scientific challenges that remain unaddressed in rare disease research.
  • Development of a computational approach to support data harmonization and data interoperability with existing standardized terminologies and ontologies for NCATS’ Genetics and Rare Diseases Information Center (GARD). One outcome from these efforts, the GARD Data Tree, has facilitated curation efforts (Zhu, et al., JMIR Med Inform. Oct 2020).
  • Provide the source code to build a blackboard system to support building such knowledge bases (Blackboard).
  • Developed a clinical decision support prototype, based on network theory and information retrieval over disease and phenotype ontologies, that takes as input a list of clinical features and provides a ranked list of relevant rare diseases (Zebra Rank).
  • Developed a comprehensive knowledge base of diseases based on semantic concept filtering of PubMed (NCATS Knowledge Base).

Molecular Profiling and Multi-Omic Methods

The IFX Core is actively working on developing omic and multi-omic algorithms and tools to help interpret these data. These efforts are highly collaborative and involve investigators within NCATS’ DPI and beyond.

  • Development of multi-omic (e.g. metabolomics, proteomics, transcriptomic), pathway-based and numerical-based integration methods
  • Development of methods for the analysis of dose response transcriptomic profiles