Python has become one of the most popular programming languages. Its concise and straightforward syntax, as well as the easy-to-use type system, ensure a flat learning curve. Besides C++, Python has become the programming language of choice in the high-energy physics community. The small entry barrier has the inadvertent disadvantage that more low-quality code is contributed to shared codebases.

This article displays a few hand-picked code examples that I’ve encountered in real-live while browsing through other people’s code. The intention is not to show that the authors had little time to work on the code, but to show these pieces and provide a better solution for the problem at hand, hoping that the same mistakes won’t be repeated.

1. Ensure directory

In the first example, the intention was to create the directory if it does not exist already. The solution I’ve seen was

os.system("mkdir -p " + name)  # Never do this!

The above code works in most of the cases but has a couple of fatal flaws.

  • The code can only run on unix-like machines. It will fail on Windows machines.
  • If name contains white spaces, it will create one directory for each space-delimited token. For example, if name = "Hello world" the above line will ensure the existence of the directory Hello and the directory world.
  • This line of Python code opens a huge security hole. Assume the variable is defined with name = "hello_world; touch x". In this case, the os.system call will execute two commands: mkdir and touch. This means we might execute arbitrary and potentially malicious code instead of ensuring the existence of a directory. Please, never do this.

Better solution

In most cases, there is no need to call os.system. Almost all common operations can be called from specialized methods without the inherent security issues of os.system.

from pathlib import Path

Path(name).mkdir(parents=True, exist_ok=True)

For more alternatives, see How can I safely create a nested directory?.

2. Write content of variable to file

In this example, the task was to write the content of a Python variable to a file. The low-quality code to solve this was

os.system("echo " + content + " > some_file")  # Never do this!

The above code works in most of the cases but has a couple of fatal flaws. The points are mostly repetitions from the previous example.

  • The code can only run on unix-like machines. It will fail on Windows machines.
  • Any line breaks in content will be converted to spaces in the output. The reason for this is that the arguments passed to echo are white-space-delimited, i.e., each line break will start a new argument. The arguments are echoed with a simple spaces as separator.
  • This line of Python code opens a huge security hole. Assume the variable is defined with content = "something; touch x; echo otherthing". In this case, the os.system call will execute three commands: echo, touch, and echo. This means we might execute arbitrary and potentially malicious code instead of simply writing to a file. Please, never do this.

Better solution

Python has first-class support of file IO with the reserved keyword open. Again, there is no need for os.system.

with open("some_file", "w") as some_file:
    some_file.write(content)

3. Join list of strings by token

The goal, in this example, is to join a list of strings, e.g, ["a", "b", "hello"] by an other string, e.g., "-" to form the composite string "a-b-hello". The piece of code I’ve come across was

items = ["a", "b", "hello"]
result = items[0]
for i in range(1, len(items)): result += "-"+items[i]

Again, there are several issues with this code.

  • The list of items is usually generated dynamically. If the list is empty, these lines will raise an IndexError instead of producing an empty string.
  • The code is rather lengthy, and it’s not obvious what the code does.
  • The loop iterates over list indices instead of the list items itself which obfuscates the actual algorithm and creates the danger of messing up the indices.
  • The variable result is being modified many times. In general, it is easier to work with constant expressions. Strings in Python are immutable, which in this case means that for every item, a new, intermediate string is created. This can be a performance issue if the list of items is long.

A bit better

The following lines ensure that empty lists are handled correctly, and the loop iterates over the items itself.

items = ["a", "b", "hello"]

result = ""
if items: 
    result = items[0]
    for item in items[1:]:
        result += "-" + item

Best solution

Python provides a method specifically for the task at hand. This method handles empty lists correctly and is very concise and easy to understand. The string "-" should join the items of the list into a composite string.

items = ["a", "b", "hello"]

result = "-".join(items)

4. Delayed imports

Python is frequently used to create command-line interfaces. The usual approach is to define the CLI with argparse after an if __name__ == "__main__": condition. The anti-pattern is as follows.

# Filename: script.py

def some_methods():
    return math.sqrt(2)

# ...

if __name__ == "__main__":
    import re
    import math
    import argparse

    parser = argparse.ArgumentParser()
    # ...

The issue here is that the imports are only executed after the __name__ == "__main__" condition. The whole point of wrapping the CLI in that condition is to allow for imports of the module by other modules. This way, another script can use, for example, some_methods() without interacting with the actual CLI. This can be useful when writing a wrapper script or when automating tasks and workflows.

However, if the imports are wrapped by the condition on __name__, they will not be executed if another module imports them.

import script
x = script.some_methods()  # Fails

Calling methods from the CLI script will fail if they depend on the imports. Here in this example, some_methods cannot find math.

Better solution

# Filename: script.py

import re
import math
import argparse

def some_methods():
    return math.sqrt(2)

# ...

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # ...

5. Lack of unit tests

During my Bachelor studies, I’ve been indoctrinated by the principles of test-driven design. Every time I write a dedicated method, I write test cases before attempting the actual implementation.

While reviewing other people’s code, I found the number of test cases too low.

Better solution

Write unit tests for every unit of code. This makes the output of the program more trustworthy and more reliable.

6. Prefer docstrings over comments

I’ve seen a couple of cases where the intent of a method was documented using Python comments.

def hello():
    # Print "Hello" to the terminal
    print("Hello")

Better solution

Use docstrings. With docstrings, Python’s integrated documentation tools, like help() or pydoc can provide information about the methods without the need for manually digging into the source code.

def hello():
    """Print "Hello" to the terminal"""
    print("Hello")

7. No Python 3

A lot of code in HEP community was written in Python 2 and there are large legacy systems running on Python 2. However, the support for Python 2 ended almost one year ago, and there is still a lot of new code written exclusively for Python 2.

Better solution

Write new software in Python 3. Potentially migrate existing codebase to Python 3.