Despite the importance of the sense of smell, the structure-odor relationship is not extensively understood [1,2]. Machine learning methods provide an opportunity to gain a better understanding of the influence of the molecular structure on its odor. However, relevant datasets are typically too small to provide a training basis for these models and thus, it is desirable to combine them [3]. This is a challenging task, due to the subjective perception and resulting varying verbal descriptions of olfactory properties for the one and same substance [4].
In this work, we investigate the disparity of verbal olfactory descriptions across different data sources. Two odor datasets are combined and annotations of overlapping molecules are analyzed. By using a pretrained Natural Language Processing model, we transform annotations into an embedding space. We then examine the similarity of these embeddings across both datasets and correlate them with their corresponding molecular descriptors.
 Thomas Gorges