Content area
The discovery and characterization of novel materials are crucial for the development of new technology. Finding suitable materials for specific applications, however, is challenging due to the diverse and sometimes conflicting requirements for their properties. The decreasing cost of computing material properties and the recent development of data infrastructures have drastically increased the amount of available materials data. Being computed for various purposes, the available data employ different physical approximations and numerical parameters. This heterogeneity poses significant challenges in integrating and comparing data from different sources.
In this thesis, we make use of descriptors and metrics to quantitatively evaluate the similarity between different materials, represented by individual calculations. To achieve this task, we developed a computational framework that allows users to compose and manage datasets, specify and compute different descriptors and metrics, compute similarity matrices, and use methods of unsupervised machine learning. We furthermore present a spectral fingerprint, i.e., a novel descriptor that encodes spectra as binary-valued raster images, allowing us to compare the similarity of different quantities, such as the electronic density-of-states, or optical absorption spectra.
We apply our methodology to assess the quality of materials data and explore large data-spaces. We demonstrate with various examples that the spectral fingerprint can be used to quantitatively describe the differences between theoretical results obtained with different physical approximations or numerical parameters, or results stemming from independent experiments. By applying our methods to larger data sets, we identify and visualize the correlations between the precision of computational results and the relevant numerical parameters. This also allows us to find calculations based on different parameters that show very similar results. To explore large data spaces, we conduct similarity searches on materials data, which reveal unexpected similarities between materials with different compositions. Furthermore, we use a clustering algorithm to find sets of materials with similar electronic structure. We identify and rationalize the main mechanisms leading to these similarities. Importantly, we find outliers that cannot be explained by simple rules. Finally, we compare the results of clustering with different similarity measures, showcasing correlations between them.
Details
Digital Object Identifier;
Big Data;
Machine learning;
Software;
Interoperability;
Application programming interface;
Materials science;
Electrons;
Computer peripherals;
Symmetry;
Eigenvalues;
Energy;
Similarity measures;
Clustering;
Artificial intelligence;
Atomic physics;
Computer science;
Information technology