Python Programing Tips

Eliminating All Whitespace

Eliminating All Whitespace

One problem that frequently arises in practice is the need to take a string, and return a copy of it with all whitespace—​leading, trailing, and internal—​removed.

Most languages support stripping whitespace from the ends out-of-the-box; for example, Python’s str.strip() method. But fewer languages directly support the complete whitespace removal required here.

According to The Zen of Python, “There should be one—​and preferably only one—​obvious way to do it”. But that was written many years ago, and in modern Python, it is easy to solve this problem in more than half a dozen different ways.

A Non-Pythonic Approach (1 of 9)

This version shows how someone coming to Python from a procedural language might tackle the problem.

def nows_indexing(s):
    t = ''
    for i in range(len(s)):
        if not s[i].isspace():
            t += s[i]
    return t

Notice that for every non-space character, two index lookups into the string are needed. Also, this approach uses string concatenation (+=). In older Python versions this was very slow, but more modern Pythons seem to have put a lot of effort into optimizing it. Even so, in performance terms this weighs in at around 248 units of time on our test machine. This is about 10x slower than the best function we’ll see.

A Python Learner’s Approach (2 of 9)

Python newcomers usually learn how to iterate over collections and strings, so they might create a function like this one.

def nows_iadd(s):
    t = ''
    for c in s:
        if not c.isspace():
            t += c
    return t

In modern Python’s this isn’t quite as bad as some of the other functions, typically taking around 177 units of time.

An Old-Style Functional Approach (3 of 9)

Here’s an approach that might be considered by real old-timers when they first learnt about functional-style programming:

def nows_filter(s):
    return ''.join(filter(lambda c: not c.isspace(), s))

The best that can be said about this is that it is short. Performance-wise it typically takes about 267 units of time and is the slowest of all the examples shown. The best we’ll see is more than 10x faster!

The Classic Approach (4 of 9)

Many textbooks recommend doing this kind of thing by adding each wanted character to a list and then joining the list at the end.

def nows_list_append_join(s):
    t = []
    for c in s:
        if not c.isspace():
            t.append(c)
    return ''.join(t)

Many Python programmers would expect this to be faster than using the string concatenation (+=) shown in two previous approaches, but this isn’t the case in practice! This one usually takes about 190 units of time.

The Generator Approach (5 of 9)

Modern Python programmers know how useful generators are. For example:

def nows_generator_join(s):
    return ''.join(c for c in s if not c.isspace())

But it is easy to forget that generators perform best if the processing involved with each iteration costs a lot more than the (tiny) generator overhead. And this isn’t one of those best cases, with a dismal typical performance of about 181 units of time

The List Comprehension Approach (6 of 9)

List comprehensions build lists in memory. This can be expensive compared with using generators—​at least for large lists or where the creation of each element is expensive. But for small lists, list comprehensions can provide good performance.

def nows_list_comp_join(s):
    return ''.join([c for c in s if not c.isspace()])

This typically takes around 154 units of time, comprehensively beating the generator approach. Keep in mind that this is not a generalisable result, so for any given situation it is best to use the timeit module or similar to compare.

Using a Regular Expression (7 of 9)

A simple regular expression can be used to match any amount of whitespace—​including newlines if we use the re.MULTILINE flag.

NOWS_RX = re.compile(r'[\s\n]+', re.MULTILINE)

def nows_re_sub_ws(s):
    return NOWS_RX.sub('', s)

This function is very simple (assuming we understand regular expressions), but has disappointing performance of around 110 units of time.

Using String Translate (8 of 9)

Python’s str class provides a static method called maketrans which can create a “translation” table mapping characters to characters. It is also possible to map characters to None which has the effect of deleting them. Once a translation table has been created it can be used with the str.translate method.

NOWS_TABLE = str.maketrans({' ': None, '\n': None, '\t': None})

def nows_str_translate(s):
    return s.translate(NOWS_TABLE)

Of course in this example we haven’t actually deleted every possible whitespace character, just a few to show how it is done. As for performance, it takes a respectable 80 units of time. (Incidentally, this shoots up to over 130 units if we replace with '' (the empty string) rather than None.)

Using C Without Using C (9 of 9)

The standard Python interpreter is written in C which is amongst the fastest languages available. So it shouldn’t be a surprise that by offloading all the work to Python functions that are implemented in C we get good performance.

def nows_split_join(s):
    return ''.join(s.split())

This is probably the simplest code of all the examples. The str.split() method returns a list of characters excluding any whitespace ones. (It is also possible to split on specified characters, but whitespace is the default.) This typically takes a mere 24 units of time, less than a tenth of that taken by a couple of the earlier functions, and far faster than any of the others.

Conclusion

On our test machine using our test data the nows_split_join() function comfortably outperformed every other method we tried for removing all whitespace from strings. Of course, in our real code we just call it nows().

def nows(s):
    return ''.join(s.split())

For more see Python Programming Tips

Top