I was working on a natural language processing (NLP) project. For reasons that would take to long to explain, I needed to tokenize text so that spaces, punctuation, and contractions are retained. For example, suppose some source text is:
Howdy. How are you doing #45?? Isn't life great!
Then the desired tokenization is:
['Howdy', '.', ' ', 'How', ' ', 'are', ' ', 'you', ' ', 'doing', ' ', '#', '45', '??', ' ', "Isn't", ' ', 'life', ' ', 'great', '!']
This is a tricky problem. I experimented with three techniques: 1.) use repeated applications of Python’s string split() function. 2.) use a regular expression. 3.) use the NLTK (natural language toolkit) word_tokenize() function. I ended up going with a regular expression.
Using split() requires tons of code. Using word_tokenize() adds a hideously complex dependency. I am not a fan of regular expressions but for my problem scenario it seemed like the best of the three options.
After a lot of futzing about, I used:
pattern = r"\b\w+(?:'\w+)?\b|[^\w\s]+|\s+" tokens = re.findall(pattern, text)
The regex r”\b\w+(?:’\w+)?\b|[^\w\s]”:
\b\w+(?:’\w+)?\b: captures words and contractions.
\b: Matches a word boundary so only whole words or
contractions are matched.
\w+: Matches one or more word characters (alphanumeric
and underscore). This covers regular words.
(?:’\w+)?: This is a non-capturing group (?:…) that
is optional ?.
‘: Matches a literal apostrophe.
\w+: Matches one or more word characters immediately
following the apostrophe (e.g., ‘t’, ‘ll’).
\b: Matches another word boundary, ensuring the end of
the word or contraction.
|: The OR operator.
[^\w\s]: This part captures individual punctuation
marks.
[^…]: Matches any single character NOT in the
specified set.
\w: Matches word characters.
\s: Matches whitespace characters.
So, [^\w\s] matches any character that is not a word character and not a whitespace character, effectively capturing punctuation like periods, commas, etc., as separate tokens.
An alternative is [.,!?;:#] to match specific punctuation.
Whew! Natural language processing can be very tricky.

Natural language processing projects are interesting. Old natural language newspaper headlines can be interesting too.
Demo program:
# tokenize_demo.py
import re # regular expressions
text = "Howdy. How are you doing #45?? Isn't life great!"
print("\nSouce text: ")
print(text)
pattern = r"\b\w+(?:'\w+)?\b|[^\w\s]+|\s+"
tokens = re.findall(pattern, text)
print("\nTokenized: ")
print(tokens)

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.