A common task when working with machine learning is to split a file of training data into a file of test data (typically 20%) and training data (the remaining 80%). There are many ways to do this. One approach that is useful when there’s no extra processing involved (such as normalizing the data) is to use a file-only approach. By this I mean to not start by reading the source file into memory.
In pseudo-code:
determine number of source lines
determine number of train, test items
generate a random ordering of lines
create dictionaries that indicate if
a source line is train or test
loop each line of source
if line belongs to train:
write line to train file
else:
write line to test file
end-loop
As always, the devil is in the details. And there are dozens of design and implementation options. When working with ML, getting data ready is never fun. Never.
# make_train_test.py
# does not read source into memory
# useful when no processing needed
import numpy as np
def file_len(fname):
f = open(fname)
for (i, line) in enumerate(f): pass
f.close()
return i+1
def main():
source_file = ".\\source_file.txt"
train_file = ".\\train_file.txt"
test_file = ".\\test_file.txt"
N = file_len(source_file)
num_train = int(0.80 * N)
num_test = N - num_train
np.random.seed(1)
indices = np.arange(N) # array [0, 1, . . N-1]
np.random.shuffle(indices)
train_dict = {}
test_dict = {}
for i in range(0,num_train):
k = indices[i]; v = i # i is not used
train_dict[k] = v
for i in range(num_train,N):
k = indices[i]; v = i
test_dict[k] = v
f_source = open(source_file, "r")
f_train = open(train_file, "w")
f_test = open(test_file, "w")
line_num = 0
for line in f_source:
if line_num in train_dict: # checks for key
f_train.write(line)
else:
f_test.write(line)
line_num += 1
f_source.close()
f_train.close()
f_test.close()
if __name__ == "__main__":
main()
.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference