Polycyclic aromatic systems (PASs) are among the most prevalent and impactful classes of compounds in the natural and man-made worlds. Though aromatic systems have captured the fascination of chemists for almost two centuries, a general conceptual framework for understanding and predicting the structure-property relationships of polycyclic systems remains elusive. We address this gap using a combination of computational chemistry and data science tools. We established the COMPAS Project—a COMputational database of Polycyclic Aromatic System—which already contains over 500k molecules in three datasets: cata-condensed polybenzenoid hydrocarbons (COMPAS-1),1 cata-condensed hetero-PASs (COMPAS-2),2 and peri-condensed polybenzenoid hydrocarbons (COMPAS-3).3 With COMPAS hand, we demonstrate the first cases of interpretable learning models in the chemical space of PASs. To this end, we developed two types of molecular representation: a) a text-based representation4 and b) a graph-based representation,5 which not only achieve higher predictive ability with fewer data, but are also amenable to interpretation – thus allowing the extraction of chemical insight from the model.Using the COMPAS database and our dedicated representations, we implemented the first guided diffused-based model for inverse design of PASs: GaUDI.6 Our model generates new PASs with defined target properties. In addition to its flexible target function and high validity scores, GaUDI also accomplishes design of molecules with properties beyond the distribution of the training data.
 Renana Gershoni-Poranne