Tokenizing Text So That Spaces, Punctuation, and Contractions Are Retained

I was working on a natural language processing (NLP) project. For reasons that would take to long to explain, I needed to tokenize text so that spaces, punctuation, and contractions are retained. For example, suppose some source text is:

Howdy. How are   you doing #45?? Isn't life great!

Then the desired tokenization is:

['Howdy', '.', ' ', 'How', ' ', 'are', '   ', 'you', ' ',
 'doing', ' ', '#', '45', '??', ' ', "Isn't", ' ', 'life',
 ' ', 'great', '!']

This is a tricky problem. I experimented with three techniques: 1.) use repeated applications of Python’s string split() function. 2.) use a regular expression. 3.) use the NLTK (natural language toolkit) word_tokenize() function. I ended up going with a regular expression.

Using split() requires tons of code. Using word_tokenize() adds a hideously complex dependency. I am not a fan of regular expressions but for my problem scenario it seemed like the best of the three options.

After a lot of futzing about, I used:

pattern = r"\b\w+(?:'\w+)?\b|[^\w\s]+|\s+"
tokens = re.findall(pattern, text)

The regex r”\b\w+(?:’\w+)?\b|[^\w\s]”:

\b\w+(?:’\w+)?\b: captures words and contractions.
\b: Matches a word boundary so only whole words or
contractions are matched.
\w+: Matches one or more word characters (alphanumeric
and underscore). This covers regular words.
(?:’\w+)?: This is a non-capturing group (?:…) that
is optional ?.
‘: Matches a literal apostrophe.
\w+: Matches one or more word characters immediately
following the apostrophe (e.g., ‘t’, ‘ll’).
\b: Matches another word boundary, ensuring the end of
the word or contraction.

|: The OR operator.

[^\w\s]: This part captures individual punctuation
marks.
[^…]: Matches any single character NOT in the
specified set.
\w: Matches word characters.
\s: Matches whitespace characters.

So, [^\w\s] matches any character that is not a word character and not a whitespace character, effectively capturing punctuation like periods, commas, etc., as separate tokens.

An alternative is [.,!?;:#] to match specific punctuation.

Whew! Natural language processing can be very tricky.

Natural language processing projects are interesting. Old natural language newspaper headlines can be interesting too.

Demo program:

# tokenize_demo.py

import re  # regular expressions

text = "Howdy. How are   you doing #45?? Isn't life great!"
print("\nSouce text: ")
print(text)

pattern = r"\b\w+(?:'\w+)?\b|[^\w\s]+|\s+"
tokens = re.findall(pattern, text)

print("\nTokenized: ")
print(tokens)