1st SciNLP: Natural Language Processing and Data Mining for Scientific Text

Talk recordings are now live on YouTube!

Transcripts of talk & panel Q&A are available here.

The 1st SciNLP workshop at AKBC 2020 has concluded. Thanks to all who participated (invited speakers, poster presenters, attendees)!

Please take some time to fill out our Feedback Form. We’ll use this to improve our processes & gauge interest in future SciNLP events.

Contact

Feel free to contact us at scinlp@googlegroups.com or on Twitter via #SciNLP!

Join the mailing list to receive announcements.

Organizers:

Kyle Lo @kylelostat, Allen Institute for AI
Iz Beltagy @i_beltagy, Allen Institute for AI
Arman Cohan @armancohan, Allen Institute for AI
Keith Hall, Google Research
Yi Luan, Google Research
Lucy Lu Wang @lucyluwang, Allen Institute for AI

Schedule

Date / Time: June 25th from 8AM - 1PM PDT (UTC-7).

Total estimated 300 minutes (5 hrs). All times are in PDT (UTC-7):

8:00 - START
8:00-8:25 - Ryan McDonald (25 min)
8:25-8:50 - Vivi Nastase (25 min)
8:50-9:15 - David Jurgens (25 min)
9:15-9:40 - Robert Stojnic (25 min)
9:40-10:25 - Poster sessions (45 min)
10:25-10:50 - Iris Shen (25 min)
10:50-11:15 - Doug Downey (25 min)
11:15-11:40 - Hoifung Poon (25 min)
11:40-12:05 - Asma Ben Abacha (25 min)
12:05-12:30 - Byron Wallace (25 min)
12:30-1:00 - Panel discussion and Q&A (30 min)
1:00 - END

Each invited talk is roughly 20 min (plus 5 min after for Q&A and buffer time for transitions).

Poster sessions

Due to the large number of submissions this year, we’ve split the presentations into three poster sessions to allow time for discussion and questions.

The Poster Session IDs correspond to Zoom rooms after logging into the virtual conference hub.

YouTube playlists

Individual submissions

All accepted abstracts & YouTube video presentations are listed below:

Poster Session	Title	Authors	Links
1	Automatic Extraction of Risk Factors from COVID-19 Literature	Francis Wolinski	abs vid
1	Semi-Automated Information Extraction to Improve Scientific Knowledge Discovery in Environmental Health Science Literature	Vickie R. Walker, Andrew A. Rooney, Nicole C. Kleinstreuer, Mary S. Wolfe, Charles P. Schmitt	abs vid
1	Softcite: Automatic Extraction of Software Mentions in Research Literature	Caifan Du, James Howison, Patrice Lopez	abs vid
1	Biomedical Synonym Normalization for Knowledge Graph Insight Generation	Eric Tanalski, Richard Wendell	abs vid
1	Normalization of Predominant and Long-tail Bacterial Entities with a Hybrid CNN-LSTM and Knowledge-Driven Model	William Hogan, Raghav Mehta, Yoshiki Vazquez-Baeza, Yannis Katsis, Ho-Cheol Kim, Chun-Nan Hsu	abs vid
1	Extracting a knowledge base of mechanisms from COVID-19 papers	Aida Amini,* Tom Hope,* David Wadden, Roy Schwartz, Hannaneh Hajishirzi	abs vid
1	DARE: Data Augmented Relation Extraction with GPT-2	Yannis Papanikolaou, Andrea Pierleoni	abs vid
1	Extraction of causal structure from procedural text for discourse representations	Michael Regan, James Pustejovsky, William Croft	abs vid
1	Semi-Open Relation Extraction from Scientific Texts	Ruben Kruiper, Julian F.V. Vincent, Jessica Chen-Burger, Marc P.Y. Desmulliez, Ioannis Konstas	abs vid
2	Incorporating Knowledge Bases into SciBERT and BioBERT pre-trained language models	Abdullah Kiwan, Sven Giesselbach, Stefan Rüping	abs vid
2	Construction and Applications of TeKnowbase	Prajna Upadhyay, Maya Ramanath	abs vid
2	AAN: Developing Educational Tools for Work Force Training	Alexander R. Fabbri, Irene Li, Swapnil Hingmire, Dragomir Radev	abs vid
2	Estimation of Research Communities in the Multilingual Academic Data	Ilya Rahkovsky, Jennifer Melot	abs vid
2	Identifying the Development and Application of Artificial Intelligence in Scientific Text	James Dunham, Jennifer Melot, Dewey Murdick	abs vid
2	Researcher-in-the-loop for systematic reviewing of text databases	Rens van de Schoot, Jonathan de Bruin	abs vid
2	Building an AI-powered Literature Review for COVID-19	Jan Bremer, Maikel Boot, Lucas Buyon, Paul Mooney, Tayab Waseem	abs vid
2	Using Text Data Mining to Analyze the NIH’s Response to the Coronavirus Pandemic	Megan Donegan,* Nancy Praskievicz,* Jacob Scholl, Kirk Baker, Judy Riggie, Sheryl Brining	abs vid
2	CoronaWhy: Building a Distributed, Credible and Scalable Research and Data Infrastructure for Open Science	Vyacheslav Tykhonov, Anton Polishko, Artur Kiulian, Maksym Komar	abs vid
3	COVIDExplorer: Exploring the Universe of COVID-19 Knowledge	Heer Ambavi, Kavita Vaishnaw, Udit Vyas, Abhisht Tiwari, Mayank Singh	abs vid
3	COVIDScholar: AI-powered rapid data gathering, analysis, and dissemination	John Dagdelen, Amalie Trewartha, Haoyan Huo, Tanjin He, Kevin Cruse, Zheren Wang, Yuxing Fei, Akshay Subramanian, Kristin Persson, Gerbrand Ceder	abs vid
3	Building a biomedical literature knowledge graph and automatic screening of biomedical abstracts using knowledge graph embeddings	Iqra Muhammad	abs vid
3	Interactive Extractive Search over Biomedical Corpora	Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen, Yoav Goldberg	abs vid
3	Combining Neural and Pattern-Based Similarity Search	Shauli Ravfogel, Hillel Taub-Tabib, Yoav Goldberg	abs vid
3	TopicForest: A Prototype Discovery Engine	Soheil Danesh	abs vid
3	CORD-19 visualization using dynamic evidence gap maps	Aravind Mohanoor	abs vid
3	Orion: An interactive information retrieval system for scientific knowledge discovery	Kostas Stathoulopoulos, Zac Ioannidis, Lilia Villafuerte	abs vid

Panel discussion: The role of scientific NLP during an epidemic

In light of the activity from the computing community to help with the current virus epidemic, we felt it important to hold a panel discussion on the role of NLP and text mining over scientific text (in particular biomedical literature).

What are useful short/long-term endeavors?
What are we doing that isn’t helpful?
How can we best connect and collaborate with domain experts?
What are challenges that prevent us from being helpful?

We’ve invited Asma Ben Abacha, Hoifung Poon, and Byron Wallace to share their thoughts and answer questions.

We will be curating questions from the audience beforehand (as well as live during the discussion).

Invited talks

Asma Ben Abacha

Insights from the Organization of International Challenges on Artificial Intelligence in Medical Question Answering

Artificial intelligence (AI) is playing an increasingly important role in our access to information. However, a one-fits-all approach is suboptimal, especially in the medical domain where health-related information is more sensitive due to its potential impact on public health, and where domain-specific aspects such as technical language and case or context-based interpretation have to be taken into account. Bridging the gap between several research areas such as AI, NLP, medical informatics, and computer vision is a promising way to achieve reliable and efficient access to medical information. In recent years, I organized several international challenges to promote research efforts in medical question answering. The organization of these competitions raised key questions in data design, evaluation metrics, and problem formulation. It also offered valuable insights on the critical subtasks that need to be solved, and on the most promising solutions in challenging problems such as restricted training data and multidisciplinary tasks. In this talk, I will share all these insights and the promising perspectives in the addressed tasks, including textual question answering and visual question answering.

Dr. Asma Ben Abacha is a staff scientist at the U.S. National Institutes of Health (NIH), National Library of Medicine (NLM), Lister Hill National Center for Biomedical Communications. Prior to joining the NLM in 2015, she was a researcher at the Luxembourg Institute of Science and Technology and lecturer at the University of Lorraine, France. Dr. Ben Abacha received a Ph.D. in computer science from Paris 11 University, France, a research master’s degree from Paris 13 University, and a software engineering degree from the National School of Computer Sciences (ENSI), Tunisia. She is currently working on medical question answering, visual question answering, and NLP-related projects in the medical domain.

Doug Downey

Mining the Citation Graph for Representation Learning and Concept Extraction

The exploding pace of scientific publication has led to a pressing need for tools that automatically make sense of the scientific literature. In this talk, I will describe two recent, simple methods for mining the citation graph to extract meaning from scientific documents. First, representation learning forms the foundation of today’s natural language processing systems, and large pretrained language models (LMs) like BERT learn powerful representations for short texts like words and sentences. But, naively applying the models to produce representations for entire scientific documents, which are necessary for many applications, is ineffective. I will introduce SPECTER, a method for producing scientific document representations using a pretrained LM that is able to achieve state-of-the-art performance by fine-tuning the LM on the citation graph as a signal of document relatedness. Second, I will describe a new concept extraction technique called ForeCite that uses the intuition that new concepts tend to be introduced or popularized by a single paper. By mining this signal from the citation graph, ForeCite achieves much higher precision than previous techniques.

Doug Downey is a research scientist at the Allen Institute for AI, where he works on the Semantic Scholar team, and also an associate professor of Computer Science at Northwestern University. His research interests involve information extraction, natural language processing, and machine learning, with a particular focus on automatically extracting knowledge from large corpora to powering new search and browsing experiences. He has won a best paper award at IJCAI, along with an NSF CAREER award, election to the DARPA Computer Science Study Group, and a Microsoft New Faculty Fellowship.

David Jurgens

Putting a Face on Science: Analyzing Author Mentions in Science Journalism Reveal Wide-Spread Ethnic Bias

Media outlets play a key role in spreading scientific knowledge to the general public and raising the profile of researchers among their peers. Yet, given social biases and attention constraints, not all scholars receive equal media coverage. In this talk, I will describe a large-scale study across hundreds of thousands of news stories that uncovers systematic ethnic bias in which authors journalists mention by name when covering science. Using NLP techniques to analyze these stories and controlling for confounds, I will show that this ethnic bias is consistent across multiple types of news media, with even larger disparities for long-form journalism focused on science.

David Jurgens is an Assistant Professor at the University of Michigan in the School of Information and by courtesy in the Department of Computer Science. He received his PhD in Computer Science from UCLA. His research in computational social science combines new methods from natural language processing and data science to discover, explain, and predict human behavior in large social systems.

Ryan McDonald

End-to-end Neural Models for Evidence Retrieval from Biomedical Literature

In this talk I will highlight some ongoing efforts at Google on improving discovery from biomedical literature. Most of the talk will focus on document and evidence retrieval, which is the most common entry point for literature tools. I will discuss three specific technical contributions: 1) synthetic question generation to train biomedical-targeted first-tage retrieval models; 2) retrieval models that encode sparse and dense representations in intuitive and flexible ways, which is critical for the domain; and 3) a joint model for document and evidence retrieval that significantly improves the systems ability to select relevant pieces of evidence from returned documents. All topics in the presentations are key technologies in https://covid19-research-explorer.appspot.com/; http://cslab241.cs.aueb.gr:5000/ and Google’s or AUEB’s submissions to the annual BioASQ challenge. Joint work with many colleagues at Google Research and Athens University of Economics and Business.

Ryan McDonald has been a research scientist at Google since 2006 and an associate research at the Athens University of Economics and Business since 2017. In that time he has been involved in various research efforts that have made user impact on a number of Google products, including Search, Assistant, Translate and Cloud. Prior to Google he completed a PhD at the University of Pennsylvania, which focused on new models for multilingual dependency parsing. This work continued at Google and culminated in the creation of the UniversalDependencies project, co-founded by his team at Google and numerous external collaborators. He currently works on discovery from biomedical literature and other productivity-driven NLP challenges.

Vivi Nastase

Looking for the dark matter within knowledge graphs

Knowledge graphs contain much useful information directly available, but also hidden information that could be leveraged in a variety of ways. Some of this dark matter includes negative instances, missing links, missing nodes and obfuscated patterns. Uncovering and using this hidden information can lead to bigger and more complete graphs, and also to a better understanding of the interaction between structured knowledge in knowledge graphs and unstructured knowledge in text collections. In this talk I will show our exploration of this dark matter in some of the most commonly used KGs – Freebase, NELL and WordNet – and discuss how the different nature of each of these graphs influenced our search and what we found.

Vivi Nastase is a research associate in the Institute for Natural Language Processing at the University of Stuttgart. She obtained a PhD from the University of Ottawa, Canada on the topic of semantic relations. She works mainly on lexical semantics, semantic relations, knowledge acquisition and language evolution, and published about 100 articles on these topics, including a book on “Semantic Relations between Nominals” in the series Synthesis Lectures on Human Language Technologies.

Hoifung Poon

Machine Reading for Precision Medicine

The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, I’ll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables us to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.

Hoifung Poon is the Senior Director of Biomedical NLP at Microsoft Research and an affiliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artificial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.

Iris Shen

Enriching a Web-scale Scientific Taxonomy by Combining Textual and Structural Information

Scientific knowledge is evolving at an unprecedented rate of speed, with new concepts and relationships constantly being discovered from the millions of academic articles being published every month. The Microsoft Academic Graph (MAG) provides a comprehensive, cross-domain scientific taxonomy covering more than 550k concepts. This fast-growing volume of scientific literature accentuates a pressing need for automated capture of emerging knowledge with an updated web-scale taxonomy. In this talk, we introduce two major efforts currently underway to enable MAG to achieve this automated capture with minimal supervision. First, we leverage a BERT-based pre-trained language model (LM) and a web search API to identify candidate concept phrases from textual information in the latest publications. Second, we apply a self-supervised position-enhanced graph neural network (GNN) that encodes local structural information to expand our taxonomy with newly discovered concepts. These two approaches achieve highly accurate concept identification results, and indicate significant improvement of our taxonomy expansion compared with previous approaches. We also discuss the challenges and lessons learned while integrating these state-of-the-art LM and GNN models into the MAG system.

Iris Shen is a principal data scientist at Microsoft Research and holds a Ph.D. in Operations Research from University of Southern California. She is the data science manager for Microsoft Academic project which uses the state-of-the-art AI research to assist humans in scientific exploration. Her current research interests are leveraging techniques in data mining, natural language processing, and recommender systems to explore and understand large-scale document corpus with associated networked systems.

Robert Stojnic

An Introduction to Papers with Code

This talk is an introduction to Papers with Code - a free resource for researchers and practitioners to find and follow the latest state-of-the-art ML papers and code. I will go deeper into the open dataset underlying Papers with Code - the collection of ML papers, code, tasks and results, with links between them. I will talk about challenges of augmenting and keeping it up-to-date this resource by using NLP techniques.

Robert is the co-creator of Papers with Code and a software engineer at Facebook AI. Robert started his career as one of the early developers of Wikipedia where he built the internal search engine. He went on to do a PhD in Applied ML in Computational Biology at University of Cambridge. He co-founded a couple of start-ups, and co-created Papers with Code with Ross Taylor. Currently he is at Facebook AI in London where he is working on Papers with Code, and is passionate about open science and open access.

Byron Wallace

What does the evidence say? Models to help make sense of the biomedical literature

How do we know if a particular medical intervention actually works better than the alternatives for a given condition and outcome? Ideally one would consult all available evidence from relevant trials that have been conducted to answer this question. Unfortunately, such results are primarily disseminated in natural language articles that describe the conduct and results of clinical trials. This imposes substantial burden on physicians and other domain experts trying to make sense of the evidence. In this talk I will discuss work on designing tasks, corpora, and models that aim to realize natural language technologies that can extract key attributes of clinical trials from articles describing them, and infer the reported findings regarding these. The hope is to use such methods to help domain experts (such as physicians) better access and make sense of unstructured biomedical evidence.

Byron Wallace is an assistant professor in the Khoury College of Computer Sciences at Northeastern University. He holds a PhD in Computer Science from Tufts University, where he was advised by Carla Brodley. He has previously held faculty positions at the University of Texas at Austin and at Brown University. His research is in machine learning and natural language processing, with an emphasis on their application in health informatics.

[CLOSED] ~Call for abstracts~

We welcome submissions of short abstracts (1 page max) related to the above research areas. Submissions may include previously published results, late-breaking results, and work in progress. Relevant submissions will be accepted for video presentation in the virtual poster session. The workshop is non-archival, so participants are free to also submit their work for publication elsewhere.

To submit an abstract, please send an email to scinlp@googlegroups.com with the subject line “SCINLP submission: [TITLE]”. Please include:

Abstract in PDF format (1-page max) as an attachment. Abstracts will be lightly reviewed to ensure that the topic is within the scope of the workshop.
Indicate which of the authors will be presenting the work. We require each accepted work to provide a ~2 minute video recording of the presentation. We will reach out with video uploading instructions after acceptance notification. Video length requirements may change depending on the number of submissions.
Indicate the scientific domain(s) and source(s) of scientific text relevant to the submission; for example, scientific domain could be “Genomics” while source of text could be “Peer-reviewed papers from PubMed”.

Writing guidelines: These abstracts can be longer than the typical abstracts for a full research paper. Figures and tables are allowed, but will count toward the length limit. References will not count toward the length limit. Abstracts do not have to be about a single paper; we allow abstracts that summarize a collection of works under a unified theme (e.g., a series of closely-related papers that build on each other or tackle a common problem). For writing examples, see the accepted abstracts from last year’s SLKB workshop at AKBC 2019.

All accepted abstracts (and their videos) will be made available online prior to the workshop and remain accessible afterwards.

If you have a disability and require accommodation in order to fully participate in the workshop, please let us know and we’ll be in touch to discuss how we can best address your needs.

Can’t make it? Check out these other upcoming workshops related to NLP and text mining over scientific text:

SDP at EMNLP 2020
BioNLP at ACL 2020

About

The primary goal of this half-day workshop is to bring together researchers from diverse fields who are interested in extracting and representing knowledge from scientific text, and/or applications or methods for improving access to and understanding of such knowledge. Such research includes, but is not limited to:

Methods in natural language processing and data mining for extracting and representing knowledge from scientific text (e.g. information extraction, entity normalization, discourse analysis, parsing, summarization, text generation, question answering, knowledge base construction, weak/distant supervision, crowdsourcing),
Applications of these methods to improving scientific knowledge discovery and/or understanding (e.g. automated literature review, search and recommender systems, techniques for data exploration and visualization),
Fairness (e.g. augmented or assistive paper reading, concept simplification, scientific education and literacy),
Science of science studies (in particular, studies that shed light on phenomena that can motivate future research in above-mentioned areas), and
Datasets, Resources (e.g. treebanks, knowledge bases), and Tools (e.g. software libraries, annotation interfaces) for conducting research in such areas.

We welcome research relevant to processing text in any domain of science (e.g. Biology, Medicine, Computer Science, Physics, Economics, Sociology, etc.) that can come from a variety of text sources (e.g. scholarly papers, surveys and technical reports, patents, tweets by scholars, blogs/tutorials, etc.)

Registration

Registration for SciNLP will be through the AKBC 2020 conference. There is a fee of $30 for students and $50 for non-students. Registering gives you access to the full conference & workshops. There is no special registration for workshop only.

Hosted on GitHub Pages — Theme by orderedlist