为啥需要Scaffold Split ?(附Scaffold Split代码)
在生信小分子中,通常有scaffold split和random split两种划分数据集的方式,特别是在MoleculeNet中,而基于scaffold的任务要比random split的任务更难,而且更有意义。因为:
论文 Analyzing Learned Molecular Representations for Property Prediction 中提到:
-
a scaffold-based split of the training and testing data is a good approximation of the temporal split commonly used in industry in terms of the relevant metrics. By contrast, a purely random split is a poor approximation to a temporal split。基于scaffold split的训练和测试数据与常用的时间拆分在相关指标方面是一个很好的近似。相比之下,纯粹的random split是时间拆分的糟糕近似 a meaningful evaluation of property prediction models needs to account explicitly for scaffold overlap between train and test data in light of generalization requirements。根据泛化要求,有意义的性能预测模型评估需要明确考虑训练和测试数据之间的支架重叠。
scaffold split代码:
import os import csv import math import numpy as np import torch import torch.nn.functional as F import rdkit from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem.Scaffolds.MurckoScaffold import MurckoScaffoldSmiles def _generate_scaffold(smiles, include_chirality=False): mol = Chem.MolFromSmiles(smiles) scaffold = MurckoScaffoldSmiles(mol=mol, includeChirality=include_chirality) return scaffold def generate_scaffolds(dataset, log_every_n=1000): scaffolds = {} data_len = len(dataset) print(data_len) print("About to generate scaffolds") for ind, smiles in enumerate(dataset.smiles_data): if ind % log_every_n == 0: print("Generating scaffold %d/%d" % (ind, data_len)) scaffold = _generate_scaffold(smiles) if scaffold not in scaffolds: scaffolds[scaffold] = [ind] else: scaffolds[scaffold].append(ind) # Sort from largest to smallest scaffold sets scaffolds = {key: sorted(value) for key, value in scaffolds.items()} scaffold_sets = [ scaffold_set for (scaffold, scaffold_set) in sorted( scaffolds.items(), key=lambda x: (len(x[1]), x[1][0]), reverse=True) ] return scaffold_sets def scaffold_split(dataset, valid_size, test_size, seed=None, log_every_n=1000): train_size = 1.0 - valid_size - test_size scaffold_sets = generate_scaffolds(dataset) train_cutoff = train_size * len(dataset) valid_cutoff = (train_size + valid_size) * len(dataset) train_inds: List[int] = [] valid_inds: List[int] = [] test_inds: List[int] = [] print("About to sort in scaffold sets") for scaffold_set in scaffold_sets: if len(train_inds) + len(scaffold_set) > train_cutoff: if len(train_inds) + len(valid_inds) + len(scaffold_set) > valid_cutoff: test_inds += scaffold_set else: valid_inds += scaffold_set else: train_inds += scaffold_set return train_inds, valid_inds, test_inds
参考文献:
Yang, Kevin, et al. "Analyzing learned molecular representations for property prediction." Journal of chemical information and modeling 59.8 (2019): 3370-3388.