LitCoin Natural Language Processing (NLP) Challenge
- Key Dates
- Background
- The Problem
- Challenge Goals
- Statutory Authority to Conduct the Challenge
- Rules and Submission Requirements
- Judging Criteria
- Prizes
- How to Enter
- Acknowledgements
- Point of Contact
Key Dates
Note: Dates are subject to change as necessary.
- September 20, 2021: Challenge announcement
- November 9, 2021: Competition launch
- December 23, 2021: End of first challenge phase
- December 27, 2021: Start of second challenge phase
- February 28, 2022: End of second challenge phase
- March 11, 2022: Final source code submission deadline
- April 12, 2022: Winners announced
Background
With an ever-growing number of scientific studies in various subject domains, there is a vast landscape of biomedical information which is not easily accessible in open data repositories to the public. Open scientific data repositories can be incomplete or too vast to be explored to their potential without a consolidated linkage map that relates all scientific discoveries. This massive amount of medical knowledge can often be computationally transformed into knowledge graphs that can be used in an open data repository and has the potential to assist in identifying gaps in medical research and accelerating research for unexplored medical domains through scientific investigations.
However, open medical data on its own is not enough to deliver its full potential for public health. By engaging technologists, members of the scientific and medical community and the public in creating tools with open data repositories, funders can exponentially increase utility and value of those data to help solve pressing national health issues. The LitCoin Natural Language Processing (NLP) Challenge seeks to spur innovation by rewarding the most creative and high-impact uses of biomedical, publication-free text to create knowledge graphs that can link concepts within existing research to allow researchers to find connections that may have been difficult to discover without them. This challenge is part of a broader conceptual initiative at NCATS to change the “currency” of biomedical research. NCATS held a Stakeholder Feedback Workshop in June 2021 to solicit feedback on this concept and its implications for researchers, publishers and the broader scientific community.
This challenge brings together government, medical research communities and data scientists to create data-driven knowledge graphs that consolidate medical scientific data across domains. With an approximately four (4)-month development cycle for the challenge, data scientists will be challenged to develop NLP systems with the ability to identify concepts from a biomedical publication and link them together into relationships to create well-linked and carefully defined knowledge graphs for each publication. To learn more about the LitCoin NLP Challenge and to sign up as a participant, visit https://bitgrit.net/competition/13?utm_source=NCATS&utm_medium=organic&utm_campaign=litcoin.
The Problem
Biomedical researchers need to be able to use open scientific data to create new research hypotheses and lead to more treatments for more people more quickly. Reading all of the literature that could be relevant to their research topic can be daunting or even impossible, and this can lead to gaps in knowledge and duplication of effort. Transforming knowledge from biomedical literature into knowledge graphs can improve researchers’ ability to connect disparate concepts and build new hypotheses, and can allow them to discover work done by others which may be difficult to surface otherwise.
To advance some of the most promising technology solutions built with knowledge graphs, the National Institutes of Health (NIH) and its collaborators are launching the LitCoin NLP Challenge. This challenge aims to (1) help data scientists better deploy their data-driven technology solutions towards accelerating scientific research in medicine and (2) ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers; together this will drive toward solutions for the critical problems these scientists aim to solve.
Challenge Goals
The challenge will spur the creation of innovative strategies in NLP by allowing participants across academia and the private sector to participate in teams or in an individual capacity. Prizes will be awarded to the top-ranking data science contestants or teams that create NLP systems that accurately capture the information denoted in free text and provide output of this information through knowledge graphs.
NCATS will share with the participants an open repository containing abstracts derived from published scientific research articles and knowledge assertions between concepts within these abstracts. The participants will use this data repository to design and train their NLP systems to generate knowledge assertions from the text of abstracts and other short biomedical publication formats. Other open biomedical data sources may be used to supplement this training data at the participants’ discretion. In addition to creating these assertions, successful participants’ NLP systems should be able to recognize which assertions are novel findings that represent the fundamental reason that the manuscript was published, as opposed to background or ancillary assertions that can be found elsewhere.
Statutory Authority to Conduct the Challenge
NCATS is conducting this challenge under the America Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science (COMPETES) Reauthorization Act of 2010, as amended [15 U.S.C. § 3719].
NCATS was established to coordinate and develop resources that leverage basic research in support of translational science and to develop partnerships and work cooperatively to foster synergy in ways that do not create duplication, redundancy and competition with industry activities. This challenge will spur innovation in NLP to advance the field and allow the generation of more accurate and useful data from biomedical publications, which will enhance the ability for data scientists to create tools to foster discovery and generate new hypotheses. This promotes the development of resources for basic science research, as well as developing partnerships with software designers in the NLP space.
Rules and Submission Requirements
Who Can Participate
This challenge is open to all U.S. citizens and permanent residents and to U.S.-based private entities. Private entities not incorporated in or maintaining a primary place of business in the U.S. and non-U.S. citizens and non-permanent residents can either participate as a member of a team that includes a citizen or permanent resident of the U.S., or they can participate on their own. However, such non-U.S. entities, citizens, and non-permanent residents are not eligible to win a monetary prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Similarly, if participating on their own, they may be eligible to win a non-cash recognition prize.
What to Submit
Participants must sign up for this competition through a joint page created by the challenge administrator, CrowdPlat, and its partner, bitgrit.
To sign up, participants must use the following link: https://bitgrit.net/competition/13?utm_source=NCATS&utm_medium=organic&utm_campaign=litcoin.
PART 1: Participant Information
For individuals:
Provide contact information, including name, phone number, email address, and employer or academic institution (if applicable).
For participants signing up as a team or for private entities:
Lead Individual: Provide contact information, including name, phone number, email address, and employer or academic institution (if applicable) for the individual who will serve as Team Lead.
Contributors: List the individuals that will contribute to the technical team that builds the final product. For each individual, please provide their name, email address, employer or academic institution (if applicable), and role in developing the product.
PART 2: Project Description
Challenge Participants who are in the top fifteen (15) participants on the public leaderboard at the conclusion of the challenge period must submit a model summary through a webform which will be provided by bitgrit at the time that the participant is requested to submit their source code for evaluation. This summary will include the following sections:
- Background: Provide your professional/academic background and any prior experience that your team had which may have helped you succeed in this competition.
- Summary of Submission: Provide a short summary of how your tool works and how you feel it may impact the field of natural language processing in biomedical science.
- Data Processing: Provide what data, if any, were used to train your system other than those provided by the challenge management team, why you felt that this data would help improve your tool’s performance, and details on how your processed this data for incorporation.
- Features Selection: Provide what you feel are the most important features of your tool, and what you believe will distinguish it from other competitors’ tools.
- Training method: Provide what training methods you used here and any important weights and parameters which went into this training.
- User Interaction: Did your team include any user-centered design in your development process, and if so, how was this accomplished?
- Interesting Findings: Provide any interesting and/or innovative aspects of your tool that may set it apart from others in the competition, and any interesting relationships within the data as you were training the system.
- Model Execution Time: Provide information on how long it takes to train your system and how long it takes to generate results.
- References: Include citations to articles, websites, blog posts, and any other appropriate external sources of information.
- Future Steps: If you were continue work on this problem, what would you do next?
Additional Submission Guidelines and Requirements
- Applications must be written in English and follow all page limits and documentation specifications (as noted above, for Part 3, this must be single-spaced, minimum 11-point font, 1-inch margins, page dimensions of 8.5 x 11 inches, submitted as a .doc/.docx or PDF file).
- Any material that does not follow the submission guidelines provided may not be considered.
- Applicants must submit a working natural language processing system that uses open biomedical data to address the challenge described above.
- “Working product” indicates that the intended features of the product have been created and currently are functional and usable by NCATS upon delivery of the final product.
- The product must be capable of being installed successfully and running consistently on the platform for which it is intended and must function as depicted or expressed in the text description.
Applicants must:
- Demonstrate participation eligibility at the time of signing up for the challenge. (For eligibility criteria, see the LitCoin NLP Challenge Terms and Conditions page, linked below.)
- Provide final products that use provided competition data, data schema, or data standards. Submissions that fail to do so are not eligible for this competition.
Read the LitCoin NLP Challenge Terms and Conditions.
Judging Criteria
This competition will run in two phases, with a defined task for each phase. The first phase will focus on the annotation of biomedical concepts from free text, and the second phase will focus on creating knowledge assertions between annotated concepts. During the competition, each submission will be tested using an automated custom evaluator which will compare the accuracy of results from provided test data with the results from industry standard natural language processing applications to create an accuracy score. This score will be continually updated on a public scoreboard during the challenge period, as participants continue to refine their software to improve their scores. At the end of the challenge period, participants will submit their final results and transfer the source code, along with a functional, installable copy of their software, to the challenge vendor for adjudication.
Source code submissions will be evaluated for compliance to competition rules by the CrowdPlat and bitgrit team on the following criteria:
- Quality of code
- Score reproducibility
- Appropriate code documentation
- No code was plagiarized from another group
Scores from these two phases will be combined into a weighted average in order to determine the final winning submissions, with phase 1 contributing 30% of the final score, and phase 2 contributing 70% of the final score. Submissions with the highest final scores at the end of the challenge that pass the above quality control evaluations will be directly evaluated by judges from federal agencies as well as potentially from non-governmental organizations, with expertise in technology, open data, product development, community engagement and user-centered design . These judges will evaluate the submissions for originality, innovation, and practical considerations of design, and will determine the winners of the competition accordingly.
Prizes
Total Cash Prize Pool
This is a single-phase competition in which up to $100,000 will be awarded by NCATS directly to participants who are among the highest scores in the evaluation of their NLP systems for accuracy of assertions.
Prize Breakdown
A total of up to $100,000 will be awarded by NCATS to the top performers of this challenge.
At this stage, NCATS anticipates that cash prizes will be awarded to seven (7) of the top performing NLP systems as follows:
First prize: $35,000
Second prize: $25,000
Third prize: $20,000
Four runner-up prizes: $5,000 each
In the case that a team, entity or individual who does not qualify to win a cash prize is selected as a prize winner, NCATS will award said winner a recognition-only prize.
NCATS may choose to award different cash prize amounts, or no prize at all, at their discretion.
Cash prizes awarded under this challenge will be paid by electronic funds transfer and may be subject to federal income taxes. The U.S. Department of Health and Human Services and NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable.
How to Enter
Teams or individuals interested in signing up for the challenge must sign up through our challenge platform partners, CrowdPlat and bitgrit, using the URL that is provided above. The challenge platform will provide full competition details including but not limited to:
- Challenge Description
- Participation Rules
- Challenge Timeline
- Prize Structure
- Submission Guidelines
- Evaluation Criteria
- Algorithm Selection Parameters
Once the competition is complete, some participants will be required to submit their source code through the platform for evaluation.
Acknowledgements
We would like to thank Dr. Zhiyong Lu, Senior Investigator at the National Library of Medicine (NLM) Intramural Research Program and his entire research team, as well as their collaborator Dr. Cecilia Arighi, Research Associate Professor in the Department of Computer and Information Sciences at the University of Delaware, for providing annotated data sets for use in the execution of this challenge.
Point of Contact
Have feedback or questions about this challenge? Please reach out to Tyler Beck at litcoin-questions@mail.nih.gov.