The importance and ubiquity of non-covalent interactions in large organic complexes coupled with their long-range nature account for the challenge they pose to quantum-mechanical methods as well as machine learning approaches. To facilitate the effort of understanding and modelling non-covalent interactions in organic systems, we have developed DIM42 dataset — a high-fidelity dataset containing 42 chemically diverse molecular dimers, each composed of a large drug-like molecule and a small aromatic monomer positioned at the available binding sites. The molecular dimers contain up to 65 atoms (with chemical composition including C, N, O, H, Cl, F, P, S) and their structures were optimised using PBE0 hybrid functional supplemented with a treatment of many-body dispersion (MBD) interactions. Among the several QM properties stored in DIM42, we exclusively provide the binding energies of these dimers at different level of theory, including but not limited to CCSSD(T), Diffusion Monte Carlo, Density Functional Theory, and Density Functional Tight Binding. Furthermore, a subset of molecular dimers were posteriorly chosen for binding curves calculations at the corresponding level of theory. The results elucidate the challenges faced by various QM methods in accurately capturing non-covalent interactions in large molecular complexes, as compared to high-fidelity benchmarks. We found that the binding energy of the equilibrium dimers spans the range of 0.2 to 1.1eV independently of the level of theory used in the calculation. We expect the DIM42 dataset would pave the way to developing and improving physical and machine learning models for accurately investigating organic systems such as biomolecules and protein-ligand systems.
 Mirela Puleva