I was working on a machine learning project that has a very large file of training data. I wanted to shuffle the file into a random order. Shuffling a small file is not a problem: you read the entire file into a List collection in memory, shuffle the list of lines, then write the shuffled lines to a destination file.
But if the source file is too large to fit into memory, you have to use a different approach. There are dozens of ways to shuffle a very large file. One approach is:
break large file into several small files
foreach small file
read file into memory as list
shuffle list of lines
write lines to small shuffled file
end-foreach
open all small shuffled files for reading
open destination file for writing
loop until all lines used
select a small shuffled file at random
read a line from selected file
write the line to destination file
end-loop
The idea is simple, but there are a lot of details to take care of, and there are a lot of low-level design alternatives.
I coded up a demo. The source file has the 50 U.S. states in alphabetical order. I broke the source into 4 small files, shuffled them, then merged the small shuffled files.
Shuffling a large text file is an example of, “Oh, that’s just an engineering problem.” When I work on machine learning and AI projects with colleagues and we’re discussing techncial issues, when we run into a problem such as, “OK, at this point we should shuffle the huge training data file.” Young and inexperienced colleagues will often start to discuss the details of the implementation. But experienced members of the team will often say something like, “That’s just an engineering problem so let’s move on.”
The idea is that there are some problems, like shuffling a huge text file, that are conceptually simple and so you know such problems can be solved. You don’t want to necessarily spend a lot of time thinking about such problems in a team meeting. Of course, such just-engineering problems can be very difficult to implement.

Making clothes from Legos is just engineering. But there are many implementation and design choices.
Code below.
# file_shuffle.py
import numpy as np
def main():
print("\nBegin file shuffle demo ")
np.random.seed(1)
num_subfiles = 4
src = "us_states.txt"
src_root= "us_states"
dest = "us_states_shuffled.txt"
# 1. break src file into several small files
# open all small subfiles for writing
f_write_handles = np.empty(num_subfiles, dtype=np.object)
for i in range(num_subfiles):
fh = open(src_root + "_" + str(i) + ".txt", "w")
f_write_handles[i] = fh
# traverse source, write to small subfiles
f_in = open(src, "r")
idx = 0
for linestr in f_in:
f_write_handles[idx].write(linestr)
idx += 1
if idx == num_subfiles: idx = 0
f_in.close()
for i in range(num_subfiles):
f_write_handles[i].close()
# 2. read small files one at a time, shuffle, write to disk
for i in range(num_subfiles):
# 2a. read a small file into memory, shuffle
lines = []
f_in = open(src_root + "_" + str(i) + ".txt", "r")
for linestr in f_in:
lines.append(linestr)
f_in.close()
np.random.shuffle(lines)
# 2b. write to small shuffled file
f_out = open(src_root + "_shuffled_" + str(i) + ".txt", "w")
for i in range(len(lines)):
f_out.write(lines[i])
f_out.close()
# 3. merge shuffled subfiles
# open all shuffled small files
f_read_handles = np.empty(num_subfiles, dtype=np.object)
for i in range(num_subfiles): # open all small shuffled files
fh = open(src_root + "_shuffled_" + str(i) + ".txt", "r")
f_read_handles[i] = fh
f_out = open(dest, "w")
more_to_read = True # priming
while more_to_read == True:
more_to_read = False # assume all used
for i in range(num_subfiles):
if f_read_handles[i].closed == False:
more_to_read = True
ix = np.random.randint(0, num_subfiles) # random small file
if f_read_handles[ix].closed == True: # try another file
continue
else:
linestr = f_read_handles[ix].readline() # read small file
if linestr == "": # hit eof
f_read_handles[ix].close()
else:
f_out.write(linestr)
f_out.close()
for i in range(num_subfiles): # check all closed
if f_read_handles[i].closed == False:
f_read_handles.close()
# manually or programmatically delete small subfiles
print("\nEnd demo ")
if __name__ == "__main__":
main()


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.