If your random seed is 42 I will come to your office and set your computer on fire🔥

Figuratively. More likely you'll get a stern talking to.

Oct 22, 2025

When you’re as old as I am, old enough to remember that there was a time before the internet, when you had to go to the library to read a book or drop coins into a metal box to make a phone call, you have absorbed a lot of geek lore. So, when you read some tutorial about machine learning or data analysis and you see random.seed(42) you go “haha, that’s funny” and you move on. Until you talk to your much younger students and you realize they all think this is an important line of code that ensures their programs run correctly. They set random seeds to 42 everywhere. They have read the documentation, they know about the random seed option, and they dutifully follow the best practices as laid out everywhere on the internet. The random seed is 42.

I cannot emphasize how bad of a choice this is. 42 was a joke guys. Don’t use 42. Ever. If your random seed is 42 I will come to your office and set your computer on fire.1

Famous slide by Jenny Bryan. I probably don’t have to set your computer on fire because she beat me to it. See also her blog post about project-oriented workflow.

But is this actually a problem? Do people really use 42 that commonly as their random seed? Yes, absolutely. Google “random seed” or “random_state” and the number 42 will pop up among your top search hits. And people may explain where the number 42 comes from (we’ll get to this below), but then they don’t talk much about whether or not this choice is a good idea. In fact, frequently you see statements along the lines of “the random seed is arbitrary, you can use any number you want, so 42 is a fine choice.” This sentence is 100% correct, assuming you’re the only one who uses 42 and you also use it only once in your entire life. Obviously this assumption is not valid. But I rarely see people point this out.

For example, the documentation for the Python machine-learning framework scikit-learn contains a lot of material about random states and various options of controlling them. Everything the documentation says is technically correct, and yet it never discourages you from using 42 as the seed. In fact, the glossary contains this gem:

Popular integer random seeds are 0 and 42.

(I hope I won’t have to explain why 0 is just as bad a choice as 42.)

The number 42 also shows up throughout the documentation, such as in code examples for the train_test_split() function. Using a fixed random seed when splitting data into training and test sets is uniquely bad, as you’re always going to be sampling the same split when you’re re-training your classifier.

Example use of the `train_test_split()` function from scikit-learn, prominently setting `random_state=42`. Taken from the official documentation for version 1.7.2, the latest stable release as of this writing.

But it gets worse. LLMs have learned about 42 and will happily put it into the code they generate. Here is some zero-shot data-analysis code I recently generated. And there it is, random_state=42 (“for reproducibility”).

Screenshot of code snippet featuring `random_state=42`. Taken from this zero-shot output generated by Claude Sonnet 4.5.

It is not surprising that attentive students, who read the documentation, read the blog posts, read the code generated by LLMs, conclude that a random seed of 42 is a good choice, and maybe even a choice that is superior to other options.

To dig deeper into why 42 is not a good choice, and is in fact uniquely bad, we need to look into how random numbers are generated in a computer.

Thanks for reading Genes, Minds, Machines! This post is public so feel free to share it.

What is a random seed?

In modern computing, we need randomness everywhere. If you’re doing machine learning and you need to subdivide your data into training and test sets, that requires a source of randomness. If you’re writing a computer game and you want your NPCs2 to show somewhat interesting and unpredictable behavior, you need randomness. If you’re simulating a physical system, you need randomness. The problem is that it’s rather complicated to generate true random numbers. The sources of true randomness we have available (for example from thermal fluctuations in specific electronics components) are nowhere fast or cheap enough to generate random numbers at the scale needed in modern computing environments.

The solution computer scientists have come up with is the pseudo-random number generator (PRNG). A PRNG is a mathematical algorithm that produces sequences of numbers statistically indistinguishable from random. Importantly, a PRNG will always produce the exact same sequence of numbers when run from the same starting point. The numbers aren’t random at all! But they look random.

So what is the random seed? It is a number that defines the initial state of the PRNG. How exactly we get from the seed to the initial state can be complicated, but the details don’t matter here. What matters is the same seed will always give you the exact same sequence of random numbers.

A second important concept to be aware of is the period length of a PRNG. The period length is the number of random values a PRNG can generate before it starts repeating. All PRNGs repeat eventually. Therefore, it is critical that your PRNG has a period length large enough that it never causes you any trouble. Let’s say you write a large numerical simulation (maybe you’re simulating the weather, or the early universe, or all the atoms in a cell) where you need trillions or more of random numbers. You wouldn’t want the random numbers to repeat during any of your simulation runs. So you need a PRNG with a period length well in excess of the maximum number of random values you may ever need.

One of the most widely used PRNGs is the Mersenne twister. It has a period length of over 10⁶⁰⁰⁰. (The exact value is 2¹⁹⁹³⁷ − 1.) This is an unimaginably large number. To give you a sense of how large it is, for comparison, there are approximately 10⁸⁰ atoms in the universe. This is tiny compared to 10⁶⁰⁰⁰. The period of the Mersenne twister has space for entire universes for every single atom in the universe, and then some. In fact, you could create an entire universe for every atom, and then create another entire universe for every atom in every of the universes you have created, and keep nesting 75 times, and still you wouldn’t run out of room in the period of the Mersenne twister. If you used the Mersenne twister to create nested universes 75 times deep, all these universes inside universes inside other universes would be different from each other.

What is a good choice for your random seed?

As I wrote above: The random seed is arbitrary. You can pick any seed you want. There are no better or worse seeds. (Unless you have a bad PRNG, but let’s ignore this complication.) In principle we could stop here. But in practice it’s a little more complicated.

While the seed is arbitrary, you don’t ever want to reuse a seed. The point of a PRNG is that its output is statistically indistinguishable from random. That’s going to be the case if you’re using a different seed every time. But if you’re reusing seeds, suddenly you have hidden correlation structures. And you may not even be aware of them.

The consequences of reusing random seeds could be benign or disastrous. It depends on the specifics of the situation. Let’s say you’re doing machine learning, and you’re using the train-test splitting code I quoted above, with a fixed random seed. In this case, you’re always splitting the data in exactly the same way. If you’re running this code five times, you’re not actually getting five independent splits, you’re getting the same split five times. At a minimum, that’s going to be wildly underestimating the variance in the performance of your fitted model. And worse consequences are possible if you’re unlucky.

If you’re following so far, and you’re starting to see that reusing random seeds can be bad, you may also realize that double-digit random seeds are bad. There are only 90 different options. If you’re in any way regularly working with random processes, you’ll easily need 90 different options in just a few days.

So, let’s go wild. Let’s use an 8-digit random seed.3 Surely that’ll give us sufficiently many different possibilities for a lifetime. Well, once you ponder it a bit, you’ll see that this doesn’t even give us a separate random sequence for every person in the United States. (The US population is approximately 340 million.) If they were all doing data science, splitting data into training and test, many of them would be using the exact same “random” splits.

The space you can possibly explore with even quite a long random seed is tiny compared to the total number of sequences you would want to have available to you, and which a PRNG such as the Mersenne Twister would certainly support.4 And the random seed 42 is uniquely bad precisely because everybody is using it. The Mersenne Twister has a state space large enough for universes within universes, but every data scientist in the entire world is using the same 10,000 “random” numbers that you get when starting with seed 42.

So we’re all on the same page, here are the first 50 of the “official” random numbers. They have been used millions of times. I encourage you to verify you get the same numbers on your computer.

>>> import random
>>> random.seed(42)
>>> [random.random() for i in range(50)]
[0.6394267984578837, 0.025010755222666936, 0.27502931836911926, 0.22321073814882275, 0.7364712141640124, 0.6766994874229113, 0.8921795677048454, 0.08693883262941615, 0.4219218196852704, 0.029797219438070344, 0.21863797480360336, 0.5053552881033624, 0.026535969683863625, 0.1988376506866485, 0.6498844377795232, 0.5449414806032167, 0.2204406220406967, 0.5892656838759087, 0.8094304566778266, 0.006498759678061017, 0.8058192518328079, 0.6981393949882269, 0.3402505165179919, 0.15547949981178155, 0.9572130722067812, 0.33659454511262676, 0.09274584338014791, 0.09671637683346401, 0.8474943663474598, 0.6037260313668911, 0.8071282732743802, 0.7297317866938179, 0.5362280914547007, 0.9731157639793706, 0.3785343772083535, 0.552040631273227, 0.8294046642529949, 0.6185197523642461, 0.8617069003107772, 0.577352145256762, 0.7045718362149235, 0.045824383655662215, 0.22789827565154686, 0.28938796360210717, 0.0797919769236275, 0.23279088636103018, 0.10100142940972912, 0.2779736031100921, 0.6356844442644002, 0.36483217897008424]

Should you explicitly set your random seed?

Why choose a specific random seed at all? Is this actually a good idea? In general, I think the answer is no. In my opinion, you’re typically better off using a random5 random seed and sampling a broader space of possibilities than picking your own random seed and risking that you’re drawing invalid conclusions from your analysis. However, there are of course specific situations in which picking a random seed is appropriate, or even required. Let’s discuss those.

Most importantly, in simulation studies, it can be helpful to start all simulations with a different but defined random seed, so that every single simulation run can be reproduced if necessary. In these types of situations, a good strategy is to take some arbitrary large integer number, say 9427385,6 and then add to it the number of the replicate you’re running. So, simulation i would have seed 9427385 + i. This requires a bit of work to get right, as you should never reuse any random seeds among simulations, but it can be useful for tracking down weird behaviors that may occur only occasionally.

Related to this point, when you’re coding a complex stochastic simulation, you may encounter bugs that happen only very rarely. Your simulation may work just fine most of the time, but every few thousand runs or so it crashes. This type of bug can be difficult to investigate. The first step is usually to identify a random seeds that reliably triggers the bug. If you can find a seed that triggers it early in the simulation run that’s even better.

Another scenario in which you might want to use defined random seeds is when you have random choices that you explicitly want to reuse multiple times. For example, if you’re doing machine learning, and you’re comparing two different models, you may want to fit them to the exact same collection of training/test splits. In this situation, you could, for example, pick ten random seeds, use each to generate one training/test split, and fit each model to each of the ten splits.

Finally, when making visualizations that contain random scatter or other elements that are randomly chosen, it can be helpful to play around with the random seed until the scatter looks pleasing. When I wrote my book on data visualization I used this technique quite frequently, for example in this chapter.

Where does 42 come from anyways?

Now that you know everything there is to know about random seeds, let’s go back to the number 42. Where does it come from? Why do people use this number in particular? The culprit is Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, a humorous, quirky science fiction novel. The book was very popular in the 1980s and 1990s, was adapted several times for radio and TV, and reached broad audiences around the world. If you’ve never read it or seen any of the adaptations, I would encourage you to check it out. Just be prepared for rather strange humor.7

In the book, we learn about some hyper-intelligent, pan-dimensional beings that built a massive computer called Deep Thought, with the specific purpose to discover the answer to “the ultimate question of life, the universe, and everything.” After several million years of computing, Deep Thought reveals that the answer is 42. When the beings don’t understand what to do with this answer, Deep Thought tells them that to understand the answer they have to figure out what exactly the question is.8 You can see a film adaptation of this scene here.

The number 42 has since turned into a meme. When somebody asks an extremely broad question, or a question that goes after deep philosophical topics such as the meaning of life, people like to respond with 42. This meme is so popular it has its own Wikipedia article. This is all very funny,9 but none of this makes 42 a good random seed.

Updates

This post generated a number of good responses. I will collect the most helpful or interesting ones here.

First, Nick Bailey pointed out a good solution to the randomness/reproducibility conundrum (where reusing seeds ruins randomness but using true random starting points ruins reproducibility): Generate a genuinely random random seed when you start up your code, write it into a log file, and then seed your random number generator. This gives you the best of both worlds.

A somewhat similar idea: Use the current date as random seed. This at a minimum gives you a different seed each day, while also avoiding the problem of appearing to have fished for the seed that gives you desired results. The devil is in the details though of how exactly you convert the date into a seed. See this post by Stephen Turner and the subsequent replies:

Thanks to Colin Gillespie over on BlueSky, I have learned how to do code searches on GitHub. So now I can report that, as of this writing, there are 496k cases of random_state=42 on GitHub.

Finally, an issue to be aware of if you’re using Matlab: It uses a fixed random seed every time, so the generated random numbers are always the same in a fresh Matlab session:

You can read more about this on the Matlab blog.

I am channeling Jenny Bryan here, who made similar statements about some widely popular, bad practices in R coding.

NPC = Non-player character. NPCs are all the elements of a game that do something autonomously, not directed by a human playing the game. For example, the monsters are usually NPCs.

I.e., any integer between 10,000,000 and 99,999,999. There 90 million possible choices in this range of numbers.

Most programming languages limit you to 32-bit integers as seed values. That’s equivalent to 4.3 billion different options, not even enough to give every living person on the planet their own sequence of random numbers.

For most programming languages, if you don’t specify a random seed, the language uses true randomness to set the initial state. The random initial state will be derived from a hardware random number generator (if available) or from the current time otherwise. This initial state will be different every time you start up your programming environment.

Do I have to say it? Don’t pick 9427385. There’s nothing special about it. I just pressed some number keys and this is what came out.

I have always preferred Adams’ novels Dirk Gently’s Holistic Detective Agency and The Long Dark Tea-Time of the Soul over the Hitchhiker’s Guide series. But the Hitchhiker’s Guide series is good, in particular the first two books. The strange humor is present in all of Adams’ writing, though.

This then leads to them building another, even bigger computer, which turns out to be all of Earth, and that’s an important component of the storyline in the book. But that’s not relevant to our discussion here, which is about random seeds.

Or not. Again, Douglas Adams’ humor was a bit weird, and always trending towards rather silly.

Mario Pasquato

Oct 22

Come on, it is well known that you have to change your random seed until your results become significant, so obviously it can’t always be 42. /s

Expand full comment

2 replies by Claus Wilke and others

Eurydice

Please let me know if you ever write a programming for biologists textbook; I learned more from this blog post than I have in multiple days of vibe coding.

3 replies by Claus Wilke and others

7 more comments...

Genes, Minds, Machines

Discussion about this post

Ready for more?