SciNLP

Links

Login to Live Virtual Hub
Workshop schedule
Poster presentations
Panel Q&A: SciNLP during an epidemic
Register

Invited speakers!

Ryan McDonald, Google Research
Vivi Nastase, University of Stuttgart
David Jurgens, University of Michigan
Robert Stojnic, PapersWithCode
Iris Shen, Microsoft Research
Doug Downey, Allen Institute for AI
Hoifung Poon, Microsoft Research
Asma Ben Abacha, National Library of Medicine
Byron Wallace, Northeastern University

SciNLP: Natural Language Processing and Data Mining for Scientific Text

Talk recordings are now live on YouTube!

Transcripts of talk & panel Q&A are available here.

The 1st SciNLP workshop at AKBC 2020 has concluded. Thanks to all who participated (invited speakers, poster presenters, attendees)!

Please take some time to fill out our Feedback Form. We’ll use this to improve our processes & gauge interest in future SciNLP events.

Contact

Feel free to contact us at scinlp@googlegroups.com or on Twitter via #SciNLP!

Join the mailing list to receive announcements.

Organizers:

Links

Schedule

Date / Time: June 25th from 8AM - 1PM PDT (UTC-7).

Total estimated 300 minutes (5 hrs). All times are in PDT (UTC-7):

Each invited talk is roughly 20 min (plus 5 min after for Q&A and buffer time for transitions).

Poster sessions

Due to the large number of submissions this year, we’ve split the presentations into three poster sessions to allow time for discussion and questions.

The Poster Session IDs correspond to Zoom rooms after logging into the virtual conference hub.

YouTube playlists

Individual submissions

All accepted abstracts & YouTube video presentations are listed below:

Poster Session Title Authors Links
1 Automatic Extraction of Risk Factors from COVID-19 Literature Francis Wolinski abs vid
1 Semi-Automated Information Extraction to Improve Scientific Knowledge Discovery in Environmental Health Science Literature Vickie R. Walker, Andrew A. Rooney, Nicole C. Kleinstreuer, Mary S. Wolfe, Charles P. Schmitt abs vid
1 Softcite: Automatic Extraction of Software Mentions in Research Literature Caifan Du, James Howison, Patrice Lopez abs vid
1 Biomedical Synonym Normalization for Knowledge Graph Insight Generation Eric Tanalski, Richard Wendell abs vid
1 Normalization of Predominant and Long-tail Bacterial Entities with a Hybrid CNN-LSTM and Knowledge-Driven Model William Hogan, Raghav Mehta, Yoshiki Vazquez-Baeza, Yannis Katsis, Ho-Cheol Kim, Chun-Nan Hsu abs vid
1 Extracting a knowledge base of mechanisms from COVID-19 papers Aida Amini,* Tom Hope,* David Wadden, Roy Schwartz, Hannaneh Hajishirzi abs vid
1 DARE: Data Augmented Relation Extraction with GPT-2 Yannis Papanikolaou, Andrea Pierleoni abs vid
1 Extraction of causal structure from procedural text for discourse representations Michael Regan, James Pustejovsky, William Croft abs vid
1 Semi-Open Relation Extraction from Scientific Texts Ruben Kruiper, Julian F.V. Vincent, Jessica Chen-Burger, Marc P.Y. Desmulliez, Ioannis Konstas abs vid
2 Incorporating Knowledge Bases into SciBERT and BioBERT pre-trained language models Abdullah Kiwan, Sven Giesselbach, Stefan Rüping abs vid
2 Construction and Applications of TeKnowbase Prajna Upadhyay, Maya Ramanath abs vid
2 AAN: Developing Educational Tools for Work Force Training Alexander R. Fabbri, Irene Li, Swapnil Hingmire, Dragomir Radev abs vid
2 Estimation of Research Communities in the Multilingual Academic Data Ilya Rahkovsky, Jennifer Melot abs vid
2 Identifying the Development and Application of Artificial Intelligence in Scientific Text James Dunham, Jennifer Melot, Dewey Murdick abs vid
2 Researcher-in-the-loop for systematic reviewing of text databases Rens van de Schoot, Jonathan de Bruin abs vid
2 Building an AI-powered Literature Review for COVID-19 Jan Bremer, Maikel Boot, Lucas Buyon, Paul Mooney, Tayab Waseem abs vid
2 Using Text Data Mining to Analyze the NIH’s Response to the Coronavirus Pandemic Megan Donegan,* Nancy Praskievicz,* Jacob Scholl, Kirk Baker, Judy Riggie, Sheryl Brining abs vid
2 CoronaWhy: Building a Distributed, Credible and Scalable Research and Data Infrastructure for Open Science Vyacheslav Tykhonov, Anton Polishko, Artur Kiulian, Maksym Komar abs vid
3 COVIDExplorer: Exploring the Universe of COVID-19 Knowledge Heer Ambavi, Kavita Vaishnaw, Udit Vyas, Abhisht Tiwari, Mayank Singh abs vid
3 COVIDScholar: AI-powered rapid data gathering, analysis, and dissemination John Dagdelen, Amalie Trewartha, Haoyan Huo, Tanjin He, Kevin Cruse, Zheren Wang, Yuxing Fei, Akshay Subramanian, Kristin Persson, Gerbrand Ceder abs vid
3 Building a biomedical literature knowledge graph and automatic screening of biomedical abstracts using knowledge graph embeddings Iqra Muhammad abs vid
3 Interactive Extractive Search over Biomedical Corpora Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen, Yoav Goldberg abs vid
3 Combining Neural and Pattern-Based Similarity Search Shauli Ravfogel, Hillel Taub-Tabib, Yoav Goldberg abs vid
3 TopicForest: A Prototype Discovery Engine Soheil Danesh abs vid
3 CORD-19 visualization using dynamic evidence gap maps Aravind Mohanoor abs vid
3 Orion: An interactive information retrieval system for scientific knowledge discovery Kostas Stathoulopoulos, Zac Ioannidis, Lilia Villafuerte abs vid

Panel discussion: The role of scientific NLP during an epidemic

In light of the activity from the computing community to help with the current virus epidemic, we felt it important to hold a panel discussion on the role of NLP and text mining over scientific text (in particular biomedical literature).

We’ve invited Asma Ben Abacha, Hoifung Poon, and Byron Wallace to share their thoughts and answer questions.

We will be curating questions from the audience beforehand (as well as live during the discussion).

Invited talks

Asma Ben Abacha

Insights from the Organization of International Challenges on Artificial Intelligence in Medical Question Answering

Artificial intelligence (AI) is playing an increasingly important role in our access to information. However, a one-fits-all approach is suboptimal, especially in the medical domain where health-related information is more sensitive due to its potential impact on public health, and where domain-specific aspects such as technical language and case or context-based interpretation have to be taken into account. Bridging the gap between several research areas such as AI, NLP, medical informatics, and computer vision is a promising way to achieve reliable and efficient access to medical information. In recent years, I organized several international challenges to promote research efforts in medical question answering. The organization of these competitions raised key questions in data design, evaluation metrics, and problem formulation. It also offered valuable insights on the critical subtasks that need to be solved, and on the most promising solutions in challenging problems such as restricted training data and multidisciplinary tasks. In this talk, I will share all these insights and the promising perspectives in the addressed tasks, including textual question answering and visual question answering.

Dr. Asma Ben Abacha is a staff scientist at the U.S. National Institutes of Health (NIH), National Library of Medicine (NLM), Lister Hill National Center for Biomedical Communications. Prior to joining the NLM in 2015, she was a researcher at the Luxembourg Institute of Science and Technology and lecturer at the University of Lorraine, France. Dr. Ben Abacha received a Ph.D. in computer science from Paris 11 University, France, a research master’s degree from Paris 13 University, and a software engineering degree from the National School of Computer Sciences (ENSI), Tunisia. She is currently working on medical question answering, visual question answering, and NLP-related projects in the medical domain.

Doug Downey

Mining the Citation Graph for Representation Learning and Concept Extraction

The exploding pace of scientific publication has led to a pressing need for tools that automatically make sense of the scientific literature. In this talk, I will describe two recent, simple methods for mining the citation graph to extract meaning from scientific documents. First, representation learning forms the foundation of today’s natural language processing systems, and large pretrained language models (LMs) like BERT learn powerful representations for short texts like words and sentences. But, naively applying the models to produce representations for entire scientific documents, which are necessary for many applications, is ineffective. I will introduce SPECTER, a method for producing scientific document representations using a pretrained LM that is able to achieve state-of-the-art performance by fine-tuning the LM on the citation graph as a signal of document relatedness. Second, I will describe a new concept extraction technique called ForeCite that uses the intuition that new concepts tend to be introduced or popularized by a single paper. By mining this signal from the citation graph, ForeCite achieves much higher precision than previous techniques.

Doug Downey is a research scientist at the Allen Institute for AI, where he works on the Semantic Scholar team, and also an associate professor of Computer Science at Northwestern University. His research interests involve information extraction, natural language processing, and machine learning, with a particular focus on automatically extracting knowledge from large corpora to powering new search and browsing experiences. He has won a best paper award at IJCAI, along with an NSF CAREER award, election to the DARPA Computer Science Study Group, and a Microsoft New Faculty Fellowship.

David Jurgens

Putting a Face on Science: Analyzing Author Mentions in Science Journalism Reveal Wide-Spread Ethnic Bias

Media outlets play a key role in spreading scientific knowledge to the general public and raising the profile of researchers among their peers. Yet, given social biases and attention constraints, not all scholars receive equal media coverage. In this talk, I will describe a large-scale study across hundreds of thousands of news stories that uncovers systematic ethnic bias in which authors journalists mention by name when covering science. Using NLP techniques to analyze these stories and controlling for confounds, I will show that this ethnic bias is consistent across multiple types of news media, with even larger disparities for long-form journalism focused on science.

David Jurgens is an Assistant Professor at the University of Michigan in the School of Information and by courtesy in the Department of Computer Science. He received his PhD in Computer Science from UCLA. His research in computational social science combines new methods from natural language processing and data science to discover, explain, and predict human behavior in large social systems.

Ryan McDonald

End-to-end Neural Models for Evidence Retrieval from Biomedical Literature

In this talk I will highlight some ongoing efforts at Google on improving discovery from biomedical literature. Most of the talk will focus on document and evidence retrieval, which is the most common entry point for literature tools. I will discuss three specific technical contributions: 1) synthetic question generation to train biomedical-targeted first-tage retrieval models; 2) retrieval models that encode sparse and dense representations in intuitive and flexible ways, which is critical for the domain; and 3) a joint model for document and evidence retrieval that significantly improves the systems ability to select relevant pieces of evidence from returned documents. All topics in the presentations are key technologies in https://covid19-research-explorer.appspot.com/; http://cslab241.cs.aueb.gr:5000/ and Google’s or AUEB’s submissions to the annual BioASQ challenge. Joint work with many colleagues at Google Research and Athens University of Economics and Business.

Ryan McDonald has been a research scientist at Google since 2006 and an associate research at the Athens University of Economics and Business since 2017. In that time he has been involved in various research efforts that have made user impact on a number of Google products, including Search, Assistant, Translate and Cloud. Prior to Google he completed a PhD at the University of Pennsylvania, which focused on new models for multilingual dependency parsing. This work continued at Google and culminated in the creation of the UniversalDependencies project, co-founded by his team at Google and numerous external collaborators. He currently works on discovery from biomedical literature and other productivity-driven NLP challenges.

Vivi Nastase

Looking for the dark matter within knowledge graphs

Knowledge graphs contain much useful information directly available, but also hidden information that could be leveraged in a variety of ways. Some of this dark matter includes negative instances, missing links, missing nodes and obfuscated patterns. Uncovering and using this hidden information can lead to bigger and more complete graphs, and also to a better understanding of the interaction between structured knowledge in knowledge graphs and unstructured knowledge in text collections. In this talk I will show our exploration of this dark matter in some of the most commonly used KGs – Freebase, NELL and WordNet – and discuss how the different nature of each of these graphs influenced our search and what we found.

Vivi Nastase is a research associate in the Institute for Natural Language Processing at the University of Stuttgart. She obtained a PhD from the University of Ottawa, Canada on the topic of semantic relations. She works mainly on lexical semantics, semantic relations, knowledge acquisition and language evolution, and published about 100 articles on these topics, including a book on “Semantic Relations between Nominals” in the series Synthesis Lectures on Human Language Technologies.

Hoifung Poon

Machine Reading for Precision Medicine

The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, I’ll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables us to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.

Hoifung Poon is the Senior Director of Biomedical NLP at Microsoft Research and an affiliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artificial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.

Iris Shen

Enriching a Web-scale Scientific Taxonomy by Combining Textual and Structural Information

Scientific knowledge is evolving at an unprecedented rate of speed, with new concepts and relationships constantly being discovered from the millions of academic articles being published every month. The Microsoft Academic Graph (MAG) provides a comprehensive, cross-domain scientific taxonomy covering more than 550k concepts. This fast-growing volume of scientific literature accentuates a pressing need for automated capture of emerging knowledge with an updated web-scale taxonomy. In this talk, we introduce two major efforts currently underway to enable MAG to achieve this automated capture with minimal supervision. First, we leverage a BERT-based pre-trained language model (LM) and a web search API to identify candidate concept phrases from textual information in the latest publications. Second, we apply a self-supervised position-enhanced graph neural network (GNN) that encodes local structural information to expand our taxonomy with newly discovered concepts. These two approaches achieve highly accurate concept identification results, and indicate significant improvement of our taxonomy expansion compared with previous approaches. We also discuss the challenges and lessons learned while integrating these state-of-the-art LM and GNN models into the MAG system.

Iris Shen is a principal data scientist at Microsoft Research and holds a Ph.D. in Operations Research from University of Southern California. She is the data science manager for Microsoft Academic project which uses the state-of-the-art AI research to assist humans in scientific exploration. Her current research interests are leveraging techniques in data mining, natural language processing, and recommender systems to explore and understand large-scale document corpus with associated networked systems.

Robert Stojnic

An Introduction to Papers with Code

This talk is an introduction to Papers with Code - a free resource for researchers and practitioners to find and follow the latest state-of-the-art ML papers and code. I will go deeper into the open dataset underlying Papers with Code - the collection of ML papers, code, tasks and results, with links between them. I will talk about challenges of augmenting and keeping it up-to-date this resource by using NLP techniques.

Robert is the co-creator of Papers with Code and a software engineer at Facebook AI. Robert started his career as one of the early developers of Wikipedia where he built the internal search engine. He went on to do a PhD in Applied ML in Computational Biology at University of Cambridge. He co-founded a couple of start-ups, and co-created Papers with Code with Ross Taylor. Currently he is at Facebook AI in London where he is working on Papers with Code, and is passionate about open science and open access.

Byron Wallace

What does the evidence say? Models to help make sense of the biomedical literature

How do we know if a particular medical intervention actually works better than the alternatives for a given condition and outcome? Ideally one would consult all available evidence from relevant trials that have been conducted to answer this question. Unfortunately, such results are primarily disseminated in natural language articles that describe the conduct and results of clinical trials. This imposes substantial burden on physicians and other domain experts trying to make sense of the evidence. In this talk I will discuss work on designing tasks, corpora, and models that aim to realize natural language technologies that can extract key attributes of clinical trials from articles describing them, and infer the reported findings regarding these. The hope is to use such methods to help domain experts (such as physicians) better access and make sense of unstructured biomedical evidence.

Byron Wallace is an assistant professor in the Khoury College of Computer Sciences at Northeastern University. He holds a PhD in Computer Science from Tufts University, where he was advised by Carla Brodley. He has previously held faculty positions at the University of Texas at Austin and at Brown University. His research is in machine learning and natural language processing, with an emphasis on their application in health informatics.

[CLOSED] ~Call for abstracts~

We welcome submissions of short abstracts (1 page max) related to the above research areas. Submissions may include previously published results, late-breaking results, and work in progress. Relevant submissions will be accepted for video presentation in the virtual poster session. The workshop is non-archival, so participants are free to also submit their work for publication elsewhere.

To submit an abstract, please send an email to scinlp@googlegroups.com with the subject line “SCINLP submission: [TITLE]”. Please include:

Writing guidelines: These abstracts can be longer than the typical abstracts for a full research paper. Figures and tables are allowed, but will count toward the length limit. References will not count toward the length limit. Abstracts do not have to be about a single paper; we allow abstracts that summarize a collection of works under a unified theme (e.g., a series of closely-related papers that build on each other or tackle a common problem). For writing examples, see the accepted abstracts from last year’s SLKB workshop at AKBC 2019.

All accepted abstracts (and their videos) will be made available online prior to the workshop and remain accessible afterwards.

If you have a disability and require accommodation in order to fully participate in the workshop, please let us know and we’ll be in touch to discuss how we can best address your needs.

Can’t make it? Check out these other upcoming workshops related to NLP and text mining over scientific text:

About

The primary goal of this half-day workshop is to bring together researchers from diverse fields who are interested in extracting and representing knowledge from scientific text, and/or applications or methods for improving access to and understanding of such knowledge. Such research includes, but is not limited to:

We welcome research relevant to processing text in any domain of science (e.g. Biology, Medicine, Computer Science, Physics, Economics, Sociology, etc.) that can come from a variety of text sources (e.g. scholarly papers, surveys and technical reports, patents, tweets by scholars, blogs/tutorials, etc.)

Registration

Registration for SciNLP will be through the AKBC 2020 conference. There is a fee of $30 for students and $50 for non-students. Registering gives you access to the full conference & workshops. There is no special registration for workshop only.

Hosted on GitHub Pages — Theme by orderedlist