Hm, wouldn’t splitting the training data by genes be even fairer? Train of 90% evaluate on the 10% held out? The promise of mutation prediction seems to be that we’d generalize the information captured in saturated mutagenesis models beyond the set of genes that have been studied? Would also be very interesting (but absolutely horrendously expensive, unless there are ESM snapshots partly through pretraining) to plot base training data volume vs R2, I think it might be related to “volume seen in base training”?
Hm, wouldn’t splitting the training data by genes be even fairer? Train of 90% evaluate on the 10% held out? The promise of mutation prediction seems to be that we’d generalize the information captured in saturated mutagenesis models beyond the set of genes that have been studied? Would also be very interesting (but absolutely horrendously expensive, unless there are ESM snapshots partly through pretraining) to plot base training data volume vs R2, I think it might be related to “volume seen in base training”?
DMS data is collected one protein at a time, you can't meaningfully split it at the protein level.