PLDock

Machine learning is being utilized more and more to speed up and empower drug design and discovery. Recently, machine learning models have played a key role in structure-based virtual screening, assisting in the analysis of protein-ligand interactions and the prediction of binding affinity levels more rapidly and correctly, a process known as protein-ligand docking. Existing datasets, such as PDBbind, allow for the training and evaluation of classical protein-ligand docking approaches, but they lack specialized facilities for machine learning models, such as large amounts of data, rich features, and structured data splits for evaluating generalization capabilities.
In this work, we introduce a comprehensive neural protein-ligand docking dataset, called PLDock, and accompany it with a tool for training and evaluating machine learning-based protein-ligand docking models. We designed real scenario-based protein-ligand docking tasks, splits, baselines and metrics, with the goal of training and evaluating machine learning-based protein-ligand docking models. Currently, PLDock provides more than 70,000 protein-ligand complex structures, more than 150,000 protein-ligand affinity data, 3 typical tasks, 5 types of structured data splits and 9 evaluation metrics. For ease of use, all data and scripts are displayed in a commonly used format and are freely available online.
PLDock was developed to address the following concerns about machine learning protein-ligand docking models:
To be suitable for training and evaluation of machine learning docking models, we decided to acquire and expose many important features of PLDock’s samples. These features were carefully selected based on our research experience and biochemical domain knowledge of protein-ligand docking. The approach adopted here to obtain features is downloading from the database and software calculation. Table 1 summarizes the types, sources, and brief introductions of these features.

In reality, we rely on old data training models to predict new data, such as training docking models with samples of existing targets to screen drugs of newly discovered targets, which can be seemed to predict on out-of-distribution (OOD) data of training set. To mimic real-world application scenarios, we use various structured data splits such as temporal split to assess the model’s generalization capabilities.
To assess the utility of PLDock dataset, we evaluated one of the latest machine learning-based proteinligand docking methods, GNINA, on the below three tasks.



Code: github
Data: GoogleDrive