Python Programing Tips

Sorting Heterogeneous Sequences

Sorting Heterogeneous Sequences

It sometimes occurs that we want to sort a sequence (a list, a dict's keys), but where the elements aren’t all of the same type. Normally this would cause Python to raise a TypeError. Here we will review two simple approaches to solving this problem (and a trivial approach that doesn’t really work).

A Trivial Approach that Works Poorly

A trivially simple approach to sort a heterogeneous sequence is like this:

seq = [9, 4, -3, 'Twelve', 17,
       datetime.date(2021, 5, 22),
       '20', '8', 150, '-14',
       datetime.date(1921, 11, 7), 'one', 0]
result = sorted(seq, key=str)
print(result)
# prints (as one line):
#   ['-14', -3, 0, 150, 17,
#    datetime.date(2021, 5, 22), '20',
#    datetime.date(1921, 11, 7), 4, '8', 9,
#    'Twelve', 'one']

Clearly, this produces an unsatisfactory result. Nor could we solve the problem by passing key=int since that will raise a ValueError when 'one' or 'Twelve' or one of the dates is encountered.

A Generic Approach

This first proper approach is completely generic and allows us to sort any mixture of elements of any types providing each element supports < in relation to another element of the same type.

seq = [9, 4, -3, 'Twelve', 17,
       datetime.date(2021, 5, 22),
       '20', '8', 150, '-14',
       datetime.date(1921, 11, 7), 'one', 0]
result = sorted(seq,
                key=lambda x: (x.__class__.__name__, x))
print(result)
# prints (as one line):
#   [datetime.date(1921, 11, 7),
#    datetime.date(2021, 5, 22),
#    -3, 0, 4, 9, 17, 150, '-14',
#    '20', '8', 'Twelve', 'one']

This produces a sane ordering and won’t raise a TypeError because elements are only compared if they are of the same type. So it is a safe approach.

The reason this works is because we return (typename, value) 2-tuples. When Python compares tuples, it starts with the first (0-th value): if these are different, their difference is the comparison’s result; and if they are the same, Python compares the second, and so on, until a difference is found or there’re no more values to compare.

So, because the first value in our 2-tuples is always a type name, when the types differ, that’s sufficient for Python and the < is applied to these names. Only when the types are the same are the values compared (hence this is a safe technique), using their natural <.

Unfortunately, this approach doesn’t provide much control. For example, if we want case-insensitive string comparisons or to sort strings before dates and so on.

A Type-Specific Approach

For this second proper approach we show how to sort elements whose types are of a know set of types: this allows us to sort elements of one particular type before elements of another particular type.

def by_custom(value):
    if isinstance(value, str):
        try: # sort as int if poss.
            return 3, int(value)
        except ValueError:
            return 1, value.lower()
    if isinstance(value, datetime.date):
        return 2, value
    if isinstance(value, int):
        return 3, value
    raise TypeError('can only sort: str int date')

seq = [9, 4, -3, 'Twelve', 17,
       datetime.date(2021, 5, 22),
       '20', '8', 150, '-14',
       datetime.date(1921, 11, 7), 'one', 0]
result = sorted(seq, key=by_custom)
print(result)
# prints (as one line):
#   ['one', 'Twelve', datetime.date(1921, 11, 7),
#    datetime.date(2021, 5, 22),
#    '-14', -3, 0, 4, '8', 9, 17, '20', 150]

Here we have chosen to sort strings before dates before ints (str < date < int). And in addition, if a string’s text is an int we sort it as an int, otherwise we sort the string case-insensitively.

Rather than using type names we’ve used simple ints, 1 for strings, 2 for dates, and so on. This makes it easy to add support for extra types and to get the ordering we want.

Clearly we can easily add additional types and logic to get any ordering of our heterogeneous data that we like.

For more see Python Programming Tips

Top