Informatics Research

Informatics scientists in NCATS’ Division of Preclinical Innovation (DPI) are involved in highly collaborative translational research projects that span bioinformatics, cheminformatics, multi-omics, data science, software engineering, biology and chemistry. Our research activities involve the development of novel analysis methodologies of various types of research data (e.g., high-throughput screening, high-throughput sequencing, metabolomics/proteomics, etc.) and the application of such novel methodologies — or existing ones — to various translational research projects. 

Read about our research activities:

Applied Research 

Applied research involves the application of state-of-the-art analysis methodologies, some of which are developed by our group, to large molecular and -omics data sets collected in translational research. Generally, we aim to identify molecules (e.g., DNA, RNA, proteins, metabolites, etc.) that identify cellular and disease states and to facilitate interpretation of these complex data to further our knowledge of biological mechanisms underlying disease and cellular mechanisms.

Metabolomics and Multi-omics Profiling to Identify Putative Biomarkers and Elucidate Disease Processes

  • Use comprehensive metabolomic and lipidomic characterization of dedifferentiated liposarcoma cell lines to identify MDM2-dependent molecular rewiring that underlies chemoresistance.
  • Evaluate metabolomic and proteomic profiles in 2-D and 3-D lung models to understand cellular responses to infection.
  • Conduct metabolomic analysis of human plasma samples in a prospective study of COVID-19 patients to identify markers of disease severity.
  • Characterize the effects of diet and prebiotic supplementation on the microbiome and metabolome that lead to the development of aberrant crypt foci and behavioral changes, respectively. 

Single-Cell Sequencing Techniques to Gain Insights into Small-Molecule Chemical Biology

  • Evaluate stem cell differentiation through single- and multi-compound studies to optimize for the intended cellular fate, including cell type classification and tracking marker gene sets through differentiation time courses.
  • Evaluate cellular response to small molecules in cancer models to understand cell type–specific responses and response heterogeneity.


Bioinformatics research at NCATS tackles a variety of challenges within translational research. Much of our research falls in the area of data harmonization to leverage biological insights from multi-omic data sets. By applying analytical techniques using transcriptomics, proteomics and metabolomics data sets, we aim to reveal insights into the mechanisms behind disease processes, support inferences in therapeutic target assessment, and understand the complexities of small-molecule interventions on cell physiology and phenotypes.

  • Aggregate various biological and chemical knowledge sources and -omic measurements to infer biological meaning from multi-omics data sets using graph-based analysis methods and visualization (e.g., RaMP, Stitcher, SmartGraph).
  • Develop pipelines to preprocess and integrate various types of -omic data (e.g., metabolomics, transcriptomics, proteomics).
  • Organize, warehouse and provide access to raw and processed -omic data sets, including related metadata and results, to enhance use, reuse, searchability, integration and visualization of in-house and publicly available data.


Cheminformatics research by informatics scientists in NCATS DPI focuses on a wide range of topics, such as molecular encoding, recognition, modeling and visualization. We are particularly interested in developing practical, open-source solutions that can benefit the scientific community.

  • Develop an open-source library to extract chemical structures from images (MolVec).
  • Develop cheminformatics utilities that support standardization, indexing, retrieval and comparison of annotations related to chemical structures (LyChI, Molwitch, Scaffold Hopper, Structure Indexer).
  • Develop methods and models to predict chemical structure-property and structure-activity and to cluster/visualize chemical structures (NCATS Predictor, HCASE).

Computational Chemistry and Molecular Modeling

Computational chemistry and molecular modeling methods used and developed by informatics scientists are applied to the modeling and simulation of small molecules and biological systems to understand and predict their behavior at the molecular level. These approaches accelerate the cost-efficient hit discovery and hit-to-lead optimization in NCATS projects.

  • Deploy hit discovery efforts that involve molecular docking and pharmacophore-based methods to prioritize ligands for high-throughput screening assays. These efforts often involve screening of large internal and external libraries of compounds to identify compounds which are most likely to bind to a drug target.
  • Develop molecular dynamics simulations and free energy calculations for assessing the stability of predicted ligand binding poses along with QSAR modeling for lead optimization.
  • Use homology modeling to predict the structure and key residues of targets with unknown structures and function annotations and to build antigen binding loops for antibody design.

Data Science 

Data science efforts by informatics scientists in NCATS DPI encompass diverse areas of artificial intelligence (AI), machine learning (ML), natural language processing (NLP), network theory, data integration, molecular docking, homology modeling and semantic data modeling. In silico models developed at NCATS facilitate the analysis of high-content imaging screens, predict activity trends that accelerate target-based and phenotypic screening processes, and provide explanatory models for the action of compounds from phenotypic screens. Modern NLP approaches and semantic modeling techniques developed at NCATS allow us to collect, integrate and analyze the large and diverse body of preclinical and clinical data, which can help researchers prioritize therapeutic hypotheses and reveal hidden relations between drugs, targets and diseases.

  • Develop ML models and apply AI approaches to predict bioactivity, toxicity, pharmacokinetics, physicochemical properties and drug synergy. The developed models help researchers navigate the chemical space with the goal of improving hit-to-lead optimization, especially when the target of the hit compound is unknown.
  • Create a tool that collects and identifies congruent lines of evidence across different knowledge domains to augment the researchers’ own knowledge and reasoning (Translator).
  • Develop web services and tools to disseminate ML-ready data sets, models and methods to the scientific community (NCATS Predictor, OpenData Portal and GitHub).
  • Produce resources to support the use of in silico models (e.g., metabolism prediction, stability of compounds, absorption and mutagenicity) that guide the scientific community in selecting the best clinical candidate for efficient and safe clinical trials.
  • Develop a clinical decision support prototype, based on network theory and information retrieval over disease and phenotype ontologies, that takes as input a list of clinical features and provides a ranked list of relevant rare diseases (Zebra Rank).
  • Develop a comprehensive knowledge base of diseases based on semantic concept filtering of PubMed (NCATS Knowledge Base), and provide the source code to build a blackboard system to support building such knowledge bases (Blackboard).