Shuffling a Very Large Text File into Random Order

I was working on a machine learning project that has a very large file of training data. I wanted to shuffle the file into a random order. Shuffling a small file is not a problem: you read the entire file into a List collection in memory, shuffle the list of lines, then write the shuffled lines to a destination file.

But if the source file is too large to fit into memory, you have to use a different approach. There are dozens of ways to shuffle a very large file. One approach is:

  break large file into several small files

  foreach small file
    read file into memory as list
    shuffle list of lines
    write lines to small shuffled file
  end-foreach

  open all small shuffled files for reading
  open destination file for writing
  loop until all lines used
    select a small shuffled file at random
    read a line from selected file
    write the line to destination file
  end-loop

The idea is simple, but there are a lot of details to take care of, and there are a lot of low-level design alternatives.

I coded up a demo. The source file has the 50 U.S. states in alphabetical order. I broke the source into 4 small files, shuffled them, then merged the small shuffled files.

Shuffling a large text file is an example of, “Oh, that’s just an engineering problem.” When I work on machine learning and AI projects with colleagues and we’re discussing techncial issues, when we run into a problem such as, “OK, at this point we should shuffle the huge training data file.” Young and inexperienced colleagues will often start to discuss the details of the implementation. But experienced members of the team will often say something like, “That’s just an engineering problem so let’s move on.”

The idea is that there are some problems, like shuffling a huge text file, that are conceptually simple and so you know such problems can be solved. You don’t want to necessarily spend a lot of time thinking about such problems in a team meeting. Of course, such just-engineering problems can be very difficult to implement.



Making clothes from Legos is just engineering. But there are many implementation and design choices.


Code below.

# file_shuffle.py

import numpy as np

def main():
  print("\nBegin file shuffle demo ")
  np.random.seed(1)

  num_subfiles = 4
  src = "us_states.txt"
  src_root= "us_states"
  dest = "us_states_shuffled.txt"

  # 1. break src file into several small files
  # open all small subfiles for writing
  f_write_handles = np.empty(num_subfiles, dtype=np.object)
  for i in range(num_subfiles):
    fh = open(src_root + "_" + str(i) + ".txt", "w")
    f_write_handles[i] = fh

  # traverse source, write to small subfiles
  f_in = open(src, "r")
  idx = 0
  for linestr in f_in:
    f_write_handles[idx].write(linestr)
    idx += 1
    if idx == num_subfiles: idx = 0
  f_in.close() 

  for i in range(num_subfiles):
    f_write_handles[i].close()

  # 2. read small files one at a time, shuffle, write to disk
  for i in range(num_subfiles):
    # 2a. read a small file into memory, shuffle
    lines = []
    f_in = open(src_root + "_" + str(i) + ".txt", "r")
    for linestr in f_in:
      lines.append(linestr)
    f_in.close()
    np.random.shuffle(lines)
    # 2b. write to small shuffled file
    f_out = open(src_root + "_shuffled_" + str(i) + ".txt", "w")
    for i in range(len(lines)):
      f_out.write(lines[i])
    f_out.close()

  # 3. merge shuffled subfiles 
  # open all shuffled small files
  f_read_handles = np.empty(num_subfiles, dtype=np.object)
  for i in range(num_subfiles):  # open all small shuffled files
    fh = open(src_root + "_shuffled_" + str(i) + ".txt", "r")
    f_read_handles[i] = fh

  f_out = open(dest, "w")  

  more_to_read = True                        # priming
  while more_to_read == True:
    more_to_read = False                     # assume all used
    for i in range(num_subfiles):
      if f_read_handles[i].closed == False:
        more_to_read = True   
    
    ix = np.random.randint(0, num_subfiles)  # random small file
    if f_read_handles[ix].closed == True:    # try another file
      continue
    else:
      linestr = f_read_handles[ix].readline() # read small file
      if linestr == "":                       # hit eof
        f_read_handles[ix].close()
      else:
        f_out.write(linestr)

  f_out.close()
  for i in range(num_subfiles):   # check all closed
    if f_read_handles[i].closed == False:
      f_read_handles.close()
      
  # manually or programmatically delete small subfiles

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
This entry was posted in Machine Learning. Bookmark the permalink.