Discussion about this post

User's avatar
Alex Medvedev's avatar

We did the same for predicting antifungal resistance arising from mutations using ESM-2 embeddings and DMS data. When we split data by position performance dropped significantly as well but stayed above the random guess. But I didn't check the ratio of low- and high-variance sites in our data, it's a great idea, and I try to do it

Michel Nivard's avatar

Hm, wouldn’t splitting the training data by genes be even fairer? Train of 90% evaluate on the 10% held out? The promise of mutation prediction seems to be that we’d generalize the information captured in saturated mutagenesis models beyond the set of genes that have been studied? Would also be very interesting (but absolutely horrendously expensive, unless there are ESM snapshots partly through pretraining) to plot base training data volume vs R2, I think it might be related to “volume seen in base training”?

5 more comments...

No posts

Ready for more?