Abstract

A large amount of materials science knowledge is generated and stored as text published in peer-reviewed scientific literature. While recent developments in natural language processing, such as Bidirectional Encoder Representations from Transformers (BERT) models, provide promising information extraction tools, these models may yield suboptimal results when applied on materials domain since they are not trained in materials science specific notations and jargons. Here, we present a materials-aware language model, namely, MatSciBERT, trained on a large corpus of peer-reviewed materials science publications. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, and establish state-of-the-art results on three downstream tasks, named entity recognition, relation classification, and abstract classification. We make the pre-trained weights of MatSciBERT publicly accessible for accelerated materials discovery and information extraction from materials science texts.

Details

Title
MatSciBERT: A materials domain language model for text mining and information extraction
Author
Gupta Tanishq 1   VIAFID ORCID Logo  ; Zaki Mohd 2   VIAFID ORCID Logo  ; Anoop, Krishnan N M 3   VIAFID ORCID Logo  ; Mausam 4   VIAFID ORCID Logo 

 Indian Institute of Technology Delhi, Hauz Khas, Department of Mathematics, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755) 
 Indian Institute of Technology Delhi, Hauz Khas, Department of Civil Engineering, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755) 
 Indian Institute of Technology Delhi, Hauz Khas, Department of Civil Engineering, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755); Indian Institute of Technology Delhi, Hauz Khas, School of Artificial Intelligence, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755) 
 Indian Institute of Technology Delhi, Hauz Khas, School of Artificial Intelligence, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755); Indian Institute of Technology Delhi, Hauz Khas, Department of Computer Science and Engineering, New Delhi, India (GRID:grid.417967.a) (ISNI:0000 0004 0558 8755) 
Publication year
2022
Publication date
2022
Publisher
Nature Publishing Group
e-ISSN
20573960
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2658986989
Copyright
© The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.