Frank Sauerburger

Character encoding issues

2024-10-09T00:00:00+02:00

If you’re parsing text or processing old non-English websites, you’ve probably encountered strings like Ã¤. These strings are symptoms of character encoding issues. This article summarizes common encoding errors and what they probably meant. Use the document as a reference to quickly identify the encoding issue you’re facing. Use CTRL+F to search for the symptom you’re seeing.

The selection of issues is highly subjective. Send me an e-mail if you think something is missing.

Symptom	Original	Original encoding	Wrongly decoded as
Ã„	Ä	UTF-8	ISO-8859-1
Ã–	Ö	UTF-8	ISO-8859-1
Ãœ	Ü	UTF-8	ISO-8859-1
Ã‰	É	UTF-8	ISO-8859-1
Ã€	À	UTF-8	ISO-8859-1
Ãˆ	È	UTF-8	ISO-8859-1
Ã™	Ù	UTF-8	ISO-8859-1
Ã‡	Ç	UTF-8	ISO-8859-1
Ã‚	Â	UTF-8	ISO-8859-1
ÃŠ	Ê	UTF-8	ISO-8859-1
ÃŽ	Î	UTF-8	ISO-8859-1
Ã”	Ô	UTF-8	ISO-8859-1
Ã›	Û	UTF-8	ISO-8859-1
Ã‹	Ë	UTF-8	ISO-8859-1
Ã	Ï	UTF-8	ISO-8859-1
Ã	Á	UTF-8	ISO-8859-1
Ã	Í	UTF-8	ISO-8859-1
Ã‘	Ñ	UTF-8	ISO-8859-1
Ã“	Ó	UTF-8	ISO-8859-1
Ãš	Ú	UTF-8	ISO-8859-1
Ã¤	ä	UTF-8	ISO-8859-1
Ã¶	ö	UTF-8	ISO-8859-1
Ã¼	ü	UTF-8	ISO-8859-1
ÃŸ	ß	UTF-8	ISO-8859-1
Ã©	é	UTF-8	ISO-8859-1
Ã	à	UTF-8	ISO-8859-1
Ã¨	è	UTF-8	ISO-8859-1
Ã¹	ù	UTF-8	ISO-8859-1
Ã§	ç	UTF-8	ISO-8859-1
Ã¢	â	UTF-8	ISO-8859-1
Ãª	ê	UTF-8	ISO-8859-1
Ã®	î	UTF-8	ISO-8859-1
Ã´	ô	UTF-8	ISO-8859-1
Ã»	û	UTF-8	ISO-8859-1
Ã«	ë	UTF-8	ISO-8859-1
Ã¯	ï	UTF-8	ISO-8859-1
Ã¡	á	UTF-8	ISO-8859-1
Ã	í	UTF-8	ISO-8859-1
Ã±	ñ	UTF-8	ISO-8859-1
Ã³	ó	UTF-8	ISO-8859-1
Ãº	ú	UTF-8	ISO-8859-1
Â¡	¡	UTF-8	ISO-8859-1
Â¿	¿	UTF-8	ISO-8859-1
â€™	’	UTF-8	ISO-8859-1
â€“	–	UTF-8	ISO-8859-1
â€”	—	UTF-8	ISO-8859-1
ÿþÄ	Ä	UTF-16	ISO-8859-1
ÿþÖ	Ö	UTF-16	ISO-8859-1
ÿþÜ	Ü	UTF-16	ISO-8859-1
ÿþÉ	É	UTF-16	ISO-8859-1
ÿþÀ	À	UTF-16	ISO-8859-1
ÿþÈ	È	UTF-16	ISO-8859-1
ÿþÙ	Ù	UTF-16	ISO-8859-1
ÿþÇ	Ç	UTF-16	ISO-8859-1
ÿþÂ	Â	UTF-16	ISO-8859-1
ÿþÊ	Ê	UTF-16	ISO-8859-1
ÿþÎ	Î	UTF-16	ISO-8859-1
ÿþÔ	Ô	UTF-16	ISO-8859-1
ÿþÛ	Û	UTF-16	ISO-8859-1
ÿþË	Ë	UTF-16	ISO-8859-1
ÿþÏ	Ï	UTF-16	ISO-8859-1
ÿþÁ	Á	UTF-16	ISO-8859-1
ÿþÍ	Í	UTF-16	ISO-8859-1
ÿþÑ	Ñ	UTF-16	ISO-8859-1
ÿþÓ	Ó	UTF-16	ISO-8859-1
ÿþÚ	Ú	UTF-16	ISO-8859-1
ÿþä	ä	UTF-16	ISO-8859-1
ÿþö	ö	UTF-16	ISO-8859-1
ÿþü	ü	UTF-16	ISO-8859-1
ÿþß	ß	UTF-16	ISO-8859-1
ÿþé	é	UTF-16	ISO-8859-1
ÿþà	à	UTF-16	ISO-8859-1
ÿþè	è	UTF-16	ISO-8859-1
ÿþù	ù	UTF-16	ISO-8859-1
ÿþç	ç	UTF-16	ISO-8859-1
ÿþâ	â	UTF-16	ISO-8859-1
ÿþê	ê	UTF-16	ISO-8859-1
ÿþî	î	UTF-16	ISO-8859-1
ÿþô	ô	UTF-16	ISO-8859-1
ÿþû	û	UTF-16	ISO-8859-1
ÿþë	ë	UTF-16	ISO-8859-1
ÿþï	ï	UTF-16	ISO-8859-1
ÿþá	á	UTF-16	ISO-8859-1
ÿþí	í	UTF-16	ISO-8859-1
ÿþñ	ñ	UTF-16	ISO-8859-1
ÿþó	ó	UTF-16	ISO-8859-1
ÿþú	ú	UTF-16	ISO-8859-1
ÿþ¡	¡	UTF-16	ISO-8859-1
ÿþ¿	¿	UTF-16	ISO-8859-1
ÿþ	’	UTF-16	ISO-8859-1
ÿþ	–	UTF-16	ISO-8859-1
ÿþ	—	UTF-16	ISO-8859-1
蓃	Ä	UTF-8	UTF-16
雃	Ö	UTF-8	UTF-16
鳃	Ü	UTF-8	UTF-16
觃	É	UTF-8	UTF-16
胃	À	UTF-8	UTF-16
裃	È	UTF-8	UTF-16
駃	Ù	UTF-8	UTF-16
蟃	Ç	UTF-8	UTF-16
苃	Â	UTF-8	UTF-16
諃	Ê	UTF-8	UTF-16
軃	Î	UTF-8	UTF-16
铃	Ô	UTF-8	UTF-16
鯃	Û	UTF-8	UTF-16
诃	Ë	UTF-8	UTF-16
迃	Ï	UTF-8	UTF-16
臃	Á	UTF-8	UTF-16
跃	Í	UTF-8	UTF-16
釃	Ñ	UTF-8	UTF-16
鏃	Ó	UTF-8	UTF-16
髃	Ú	UTF-8	UTF-16
꓃	ä	UTF-8	UTF-16
뛃	ö	UTF-8	UTF-16
볃	ü	UTF-8	UTF-16
鿃	ß	UTF-8	UTF-16
꧃	é	UTF-8	UTF-16
ꃃ	à	UTF-8	UTF-16
ꣃ	è	UTF-8	UTF-16
맃	ù	UTF-8	UTF-16
ꟃ	ç	UTF-8	UTF-16
ꋃ	â	UTF-8	UTF-16
꫃	ê	UTF-8	UTF-16
껃	î	UTF-8	UTF-16
듃	ô	UTF-8	UTF-16
믃	û	UTF-8	UTF-16
ꯃ	ë	UTF-8	UTF-16
꿃	ï	UTF-8	UTF-16
ꇃ	á	UTF-8	UTF-16
귃	í	UTF-8	UTF-16
뇃	ñ	UTF-8	UTF-16
돃	ó	UTF-8	UTF-16
뫃	ú	UTF-8	UTF-16
ꇂ	¡	UTF-8	UTF-16
뿂	¿	UTF-8	UTF-16
Ă„	Ä	UTF-8	Windows 1250
Ă–	Ö	UTF-8	Windows 1250
Ăś	Ü	UTF-8	Windows 1250
Ă‰	É	UTF-8	Windows 1250
Ă€	À	UTF-8	Windows 1250
Ă™	Ù	UTF-8	Windows 1250
Ă‡	Ç	UTF-8	Windows 1250
Ă‚	Â	UTF-8	Windows 1250
ĂŠ	Ê	UTF-8	Windows 1250
ĂŽ	Î	UTF-8	Windows 1250
Ă”	Ô	UTF-8	Windows 1250
Ă›	Û	UTF-8	Windows 1250
Ă‹	Ë	UTF-8	Windows 1250
ĂŹ	Ï	UTF-8	Windows 1250
ĂŤ	Í	UTF-8	Windows 1250
Ă‘	Ñ	UTF-8	Windows 1250
Ă“	Ó	UTF-8	Windows 1250
Ăš	Ú	UTF-8	Windows 1250
Ă¤	ä	UTF-8	Windows 1250
Ă¶	ö	UTF-8	Windows 1250
ĂĽ	ü	UTF-8	Windows 1250
Ăź	ß	UTF-8	Windows 1250
Ă©	é	UTF-8	Windows 1250
Ă	à	UTF-8	Windows 1250
Ă¨	è	UTF-8	Windows 1250
Ăą	ù	UTF-8	Windows 1250
Ă§	ç	UTF-8	Windows 1250
Ă˘	â	UTF-8	Windows 1250
ĂŞ	ê	UTF-8	Windows 1250
Ă®	î	UTF-8	Windows 1250
Ă´	ô	UTF-8	Windows 1250
Ă»	û	UTF-8	Windows 1250
Ă«	ë	UTF-8	Windows 1250
ĂŻ	ï	UTF-8	Windows 1250
Ăˇ	á	UTF-8	Windows 1250
Ă	í	UTF-8	Windows 1250
Ă±	ñ	UTF-8	Windows 1250
Ăł	ó	UTF-8	Windows 1250
Ăş	ú	UTF-8	Windows 1250
Âˇ	¡	UTF-8	Windows 1250
Âż	¿	UTF-8	Windows 1250
â€™	’	UTF-8	Windows 1250
â€“	–	UTF-8	Windows 1250
â€”	—	UTF-8	Windows 1250
√Ñ	Ä	UTF-8	Mac Roman
√ñ	Ö	UTF-8	Mac Roman
√ú	Ü	UTF-8	Mac Roman
√â	É	UTF-8	Mac Roman
√Ä	À	UTF-8	Mac Roman
√à	È	UTF-8	Mac Roman
√ô	Ù	UTF-8	Mac Roman
√á	Ç	UTF-8	Mac Roman
√Ç	Â	UTF-8	Mac Roman
√ä	Ê	UTF-8	Mac Roman
√é	Î	UTF-8	Mac Roman
√î	Ô	UTF-8	Mac Roman
√õ	Û	UTF-8	Mac Roman
√ã	Ë	UTF-8	Mac Roman
√è	Ï	UTF-8	Mac Roman
√Å	Á	UTF-8	Mac Roman
√ç	Í	UTF-8	Mac Roman
√ë	Ñ	UTF-8	Mac Roman
√ì	Ó	UTF-8	Mac Roman
√ö	Ú	UTF-8	Mac Roman
√§	ä	UTF-8	Mac Roman
√∂	ö	UTF-8	Mac Roman
√º	ü	UTF-8	Mac Roman
√ü	ß	UTF-8	Mac Roman
√©	é	UTF-8	Mac Roman
√†	à	UTF-8	Mac Roman
√®	è	UTF-8	Mac Roman
√π	ù	UTF-8	Mac Roman
√ß	ç	UTF-8	Mac Roman
√¢	â	UTF-8	Mac Roman
√™	ê	UTF-8	Mac Roman
√Æ	î	UTF-8	Mac Roman
√¥	ô	UTF-8	Mac Roman
√ª	û	UTF-8	Mac Roman
√´	ë	UTF-8	Mac Roman
√Ø	ï	UTF-8	Mac Roman
√°	á	UTF-8	Mac Roman
√≠	í	UTF-8	Mac Roman
√±	ñ	UTF-8	Mac Roman
√≥	ó	UTF-8	Mac Roman
√∫	ú	UTF-8	Mac Roman
¬°	¡	UTF-8	Mac Roman
¬ø	¿	UTF-8	Mac Roman
‚Äô	’	UTF-8	Mac Roman
‚Äì	–	UTF-8	Mac Roman
‚Äî	—	UTF-8	Mac Roman

I’ve used the following Python script to create the table.

pairs = [
    ("UTF-8", "ISO-8859-1"),
    ("UTF-16", "ISO-8859-1"),
    ("UTF-8", "UTF-16"),
    ("UTF-8", "Windows 1250"),
    ("UTF-8", "Mac Roman"),
]
chars = "ÄÖÜÉÀÈÙÇÂÊÎÔÛËÏÁÍÑÓÚäöüßéàèùçâêîôûëïáíñóú¡¿’–—"

for orig, wrong in pairs:
    for c in chars:
        try:
            enc = c.encode(orig).decode(wrong)
        except UnicodeDecodeError:
            continue
        print("| " + "".join(f"&#{ord(d):d};" for d in enc) + f" | {c} | {orig} | {wrong} |")

Deutsche Bahn delay

2024-09-11T00:00:00+02:00

Deutsche Bahn is not known for its punctuality. Since the Euro 2024, it has been known for its delays. I’m a frequent train commuter and happended to ride on an ICE on September 2, 2024 that was scheduled to arrive at 7:47. When the train left my home station earlier that morning, it was delayed due to technical difficulties on the train. The actual arrival was 7:54. While disembarking the train, I took the following photo.

Note, that the screen shows the time 00:37 and announces that the train arrives in 2457017 minutes, that’s in more than four years! What a bummer!

Ok, let’s compute backward.

The train arrived at 7:54 on September 2, 2024, Berlin daylight saving time, or 5:54 UTC.

When was 2457017 minutes earlier? Let’s find out with Python.

In [1]: from datetime import datetime, timedelta

In [2]: from zoneinfo import ZoneInfo

In [3]: berlin = ZoneInfo("Europe/Berlin")

In [4]: utc = ZoneInfo("UTC")

In [5]: arrival = datetime(2024, 9, 2, 7, 54, tzinfo=berlin).astimezone(utc)

In [6]: arrival.isoformat()
Out[6]: '2024-09-02T05:54:00+00:00'

In [7]: arrival_in = timedelta(minutes=2457017)

In [8]: train_time = arrival - arrival_in

In [9]: train_time.astimezone(berlin).isoformat()
Out[9]: '2020-01-01T00:37:00+01:00'

The train clock was set to 00:37 on January 1, 2020.

It’s nice to see that this matches the time displayed on the screen, but it is from more than four years and eight months ago.

Not so fun fact: If you assumed everything will be fine as long as you stick to timezone-aware objects, you’ll have a bad time: timedelta arithmetics across the daylight savings time border yield unexpected results.

In [1]: from datetime import datetime, timedelta

In [2]: from zoneinfo import ZoneInfo

In [3]: berlin = ZoneInfo("Europe/Berlin")

In [4]: sat = datetime(2024, 10, 26, 21, 0, 0, tzinfo=berlin)

In [5]: sat.isoformat()
Out[5]: '2024-10-26T21:00:00+02:00'

In [6]: twelve_hours = timedelta(hours=12)

In [7]: (sat + twelve_hours).isoformat()
Out[7]: '2024-10-27T09:00:00+01:00'

The initial Saturday date is in the +02:00 timezone, i.e., daylight savings time, while the 12-hour shifted date is in +01:00, i.e., standard time. However, a person starting a stopwatch at 9 pm on Saturday DST and stopping it at 9 am on Sunday SDT would observe that 13 hours elapsed.

AI infrastructure at resonable scale

2024-08-14T00:00:00+02:00

From data analysis in Jupyter Notebooks to production applications–In this blog post, I’d like to introduce ideas for an AI infrastructure at a reasonable scale, to bridge the gap between doing

data analysis with artificial intelligence (AI) and machine learning (ML) in a Jupyter Notebook and
building production applications.

So, let’s start with a quote by Michael Dell:

We are unleashing this super genius power. Everyone is going to have access to this technology […]

I think the key question here is how we define “having access” to the technology and what it entails. Will it be access to a cloud API with a pay-as-you-go subscription model or will it be possible for us, as developers, to build our own applications with our own models, fine-tuned models, and even on our own hardware? At the moment when looking around the internet for resources on how to build an AI application, most resources rely on a cloud API.

I did a highly subjective and biased Google search. Don’t quote me on the numbers as the results are influenced by my search history. I used various combinations of the keywords: tutorial, Python, web app, FastAPI, AI, and LangChain.

60 % of the results relied on a cloud API or even closed-weight cloud models,
40 % didn’t build an application but rather illustrated a general data science workflow, with exploratory data analysis, while only
10 % build a complete application, mostly still limited to a single Python module.

The percentages don’t add up to 100 %, since I’m not dealing with exclusive categories. However, with these numbers, you see a bias towards cloud API models. Most importantly, two things are generally not covered, namely,

How to use models locally on-premise, and
How to use your own models, trained specifically for your domain.

I’m not arguing that this information is not available, but it’s almost drowned out by the wealth of information for cloud API models. The purpose of this article is to contribute to counteract this tendency.

Cloud or on-premise

Different people have answered the decision to build an application in the cloud or on-premise differently over the years. To be clear, there is no one true answer. It depends on the contexts, the requirements and the constraints. Both sides have pros and cons. I will list a few benefits for each solution given an AI, ML, and data science context.

Doing it in the cloud

Advantages of building an AI application in the cloud include:

The cloud offers managed services, which, if used, imply less maintenance work and cost.
Building an AI application in the cloud means you get access to closed-weight models that are not accessible otherwise, like GPT4o.
In the cloud you pay as you go, so during times when you don’t use services or only to a small extent in terms of volume and time, you pay less.
The initial cost to get started is small.
Modern cloud providers allow you and encourage you to define the entire infrastructure as code which has the advantage of being archivable and versionable.
If developed appropriately, applications in the cloud have unprecedented ability to scale on demand.

Doing it on-premise

Building applications on-premise doesn’t come with a vendor lock-in. You can switch software and hardware anytime.
There is no data lock-in. If data is stored locally on hard disks, it’s easy to replicate, copy to other media, or transport the data. In contrast in the cloud, it can become very costly to transfer your own data out from the cloud.
For private projects, in academia, or in a company, you could reuse existing hardware resources.
Initially, maintenance is more expensive, however, there are economies of scale. You need one system administrator to maintain one server, but you don’t need 100 system administrators to maintain 100 servers.
By storing data on-premise you can enforce very strict data governance and data privacy regulations.

Both approaches have their benefits. In this post, I will focus on on-premise setups, however, it is possible to build the same infrastructure using cloud resources and services.

The goal of the infrastructure

The target domain of the infrastructure is

Personal projects,
Projects in academia, and
Products in businesses.

Therefore the infrastructure is applicable in a very wide range of domains. The requirements change depending on the scale and domain and the ideas presented in this post, don’t necessarily apply to all domains equally. For personal projects, it might be sufficient to select a subset of components and features from the infrastructure, while for large commercial applications, scaling becomes an important aspect and more pieces of the infrastructure need to be in place.

The infrastructure is evaluated against four metrics, namely,

Reproducibility: How easy is it so that other people can rerun inference requests or a study and obtain the same results?
Rapid development: Can applications be developed quickly by relying on the infrastructure?
Reasonably scaleable: Does the setup scale to a reasonable scale? I mean by that will be defined later.
Efficient use of resources: Can we efficiently use the available and pre-existing hardware resources?

After having set out these metrics, let’s see what the actual architecture of the infrastructure looks like.

Layered architecture

The key insight is that we need a layered architecture. If you want a single takeway from this article, that’s it. To be honest, I didn’t invent layered architectures. Layered architectures make sense for classic software projects and I argue that they make a lot of sense for AI infrastructure. The components of the infrastructure are split into three distinct layers: the models, a gateway, and the applications. Let’s start with models.

Model layer

This layer comprises the raw AI models. When I speak about models, I mean to include a wide range of models. For example, you might be working with

Classifiers or regressors, let it be for text, images, sound, or application-specific measurement points,
Embedding models, again, for text, images, sound, graph nodes, or more complex application-specific structures,
Large language models ranging from smallish models with 100s of millions of weights to gigantic models with 100s of billions of weights–whatever is possible with your hardware,
Vector databases, as they are trained, in a sense, when points are added to the collection and we can run inference by doing approximate nearest neighbor lookups, and
Anything else that you might fancy. Nobody is stopping you from adding an adaptor to cloud API models at this stage and building a hybrid architecture.

The models at this stage are the raw model weights packaged together with inference servers like mlserver, or Huggingface text inference server. The inference servers usually have a generic API schema, which can be used for a very wide range of models, but on the other hand, get can be difficult to use: With a generic schema, there is no built-in documentation of what inputs your model expects. You might be sending the inputs in the wrong format. And, similarity, you might parse and use the return values incorrectly.

To achieve scalability and on-demand scalability, it is beneficial to deploy the models as containers. By following general DevOps practices at this stage, we also ensure that the models are versioned and deployed properly and consistently.

Most of the software that’s needed to build the model layer is off-the-shelf.

Gateway

The gateway is a central point in the architecture that every request to the AI models needs to traverse. This doesn’t imply it is a single point of failure as the gateway API can be set up in high-availability mode. The gateway acts as a load balancer to the on-demand scaled models. Because of its location in the architecture, the gateway can impose a schema. Requests to each model need to comply with the input schema. Clients of the gateway can use common auto-documentation tools to learn the input and output schema, e.g., OpenAPI and Swagger, gRPC service definitions, etc.

The gateway is also a good place to add authentication and authorization. Maybe not every user or every application should have access to all models. Similarly, we can add quotas. One user or one application should not be able to consume all available resources and cause a denial of service for everyone else. This is more important for business domains with a multi-user environment. Additionally, the gateway is a good place to add monitoring to see how models are used and if someone might abuse the service. This becomes much more important as soon as there are large language models in the model layer.

The gateway is custom code. There are projects and initiatives to have a generic open-source solution for the gateway, but in my view, a custom solution offers more benefits.

Finally, in my experience, it pays off to invest time when developing the gateway, to build a small Python library that acts as an adaptor or proxy for the models. Our goal should be to make using models as easy as it gets. Popular cloud API providers, like OpenAI, show how it can and should be done. For example, using a model should be as easy as

from x import y
y("my input text")

where x is our custom library to access the gateway and y is one of our models. We will later see how this affects the development time of applications. It also helps to focus on the business part while building applications and abstracting the complicated AI models into a simple function call.

Applications

Lastly, let’s look at the applications layer. By applications here, I mean applications in a very broad sense. This can be

One-off analysis in a Jupyter Notebook,
Showcase examples in a Jupyter Notebook,
A FastAPI API that’s released as a service or the backend of a web frontend,
Large-scale batch jobs to process bulk data and compile an analysis report, and
Anything else that needs AI model inferences to work.

The application layer is the correct place to implement any application-specific business or analysis logic. It’s also the right place to join inference results with additional data sources like databases or files. Any CPU and IO-intensive data transformation, data filtering, data augmentation, or data aggregation should take place in the application layer and not in the model layer.

The application layer is also the ideal place to do A/B testing. A/B testing sounds like it’s only relevant in the business domain where we collect feedback from actual user traffic to test one approach or model (A) against another model (B). However, this is in principle the same as a benchmark in an academic domain: Which model is more sensitive to the quantity that I want to extract and measure from a dataset?

If necessary, the application layer is also the place to implement caching. Caching can imply storing inference responses or the result after applying additional business and analysis logic. It depends on what is the bottleneck. If it is pure inference time, then caching model results might work. If the business logic and data transformations are the bottleneck, then maybe caching the final result is the answer.

Evaluation

After having introduced and explained the infrastructure, it’s now time to evaluate how the infrastructure fosters reproducibility, rapid development, reasonable scalability, and efficient use of resources. Let’s start with reproducibility.

Reproducibility

What do I mean by reproducibility in this context? The infrastructure should give people the opportunity to repeat a study, and analysis, or any inference request and obtain the same result. Managing the input data, however, is beyond the scope of the AI infrastructure. Joel Grus gave a famous talk at JupyterCon 2018 titled I don’t like notebooks. If you don’t know the talk, go search for the recording, it’s very entertaining. One of the messages from the talk could be summarised by

If your output is science, you need reproducibility through best software engineering practices.

How do best software engineering practices help us in this case?

The model layer promotes the use of containerized models that are deployed with deployment pipelines. If we make the step and view models as testable product releases, we gain a lot in terms of reproducibility. The architecture of the infrastructure is very welcoming if we follow this principle. When adhering to that and treating models as testable product releases, we get a few things for free that help us to ensure reproducibility

Models are automatically versioned,
Models go through a proper deployment process.

With these two benefits, we eliminate a whole range of problems, i.e., the equivalent of “works on my machine” in the AI world–The model doesn’t run on your machine anymore. Deployments are tracked and in the best case can be easily reverted. This also eliminates the need to copy model weights manually from one machine to another with scp. We also ensure that any inference code that’s wrapped around the model, including any preprocessing or post-processing, is applied consistently. This includes for example tokenization of input text or shifting and scaling input variables. Additionally, we ensure that the environment and the dependencies of our model are consistent and available at runtime.

Rapid development

Consider the following code example of a FastAPI endpoint that accepts an input string, computes a semantic embedding, and searches for similar documents in a vector database with the qdrant client. You can find similar snippets all over the internet.

app = FastAPI()
embeddigns = HuggingFaceEmbeddigns(model_name="mymodel")
qdrant_client = QdrantClient("localhost:6333")

@app.post("/serach")
def serach(query: str) -> list[str]:
    vector = embeddings.embed_query(query)
    points = qdrant_client.search("mdpi", ("mymodel", vector))
    return (point.payload["title"] for point in points)

This might look like a quick and sensible solution, however, if your coding style is anything like mine, you change the file, save it, and retest the code many, many times. In my view, this iteration loop is vital during development. This is where it gets problematic when working with AI models. Every time we save the file and reload the API, we need to reload the embedding model into memory. This might be fine for small models with millions of weights, however, already a few seconds of delay can be annoying and slow down your workflow. As models tend to become larger, it takes minutes or up to an hour to load it into memory. At this point, any iterative development workflow should be considered broken. So how does the infrastructure help us here?

The model and the application code are decoupled. We have our custom Python library x that we can use to send inference requests to the gateway. The library has a minimal footprint and reloading this library should not entail a noticeable delay. Furthermore, our library to access the gateway has minimal dependents, so we require only a minimal setup and boilerplate code to use a model (from x import y). Using the model becomes as easy as it gets.

You might argue that treating models as product releases slows down the development cycle. I agree. It might be tempting to only release models once they have reached a certain level of “maturity” by passing some checks and benchmarks. However, speaking with reproducibility in mind, if benchmarks are run without properly releasing the model, are you sure that you can reproduce the exact same benchmark with the same model? If the answer is no, then the benchmark is meaningless. In my view, the additional investment of treating every model as a testable release pays off.

Reasonably scaleable

What is reasonably scalable? To give you a number: I mean working with datasets that are terabytes in size. This could be a few terabytes or 10s of terabytes. As a comparison in the domain of NLP. The whole compressed English Wikipedia dump as of 2024 is around 30 GB–that’s multiple orders of magnitude smaller. I argue that terabytes are actually sufficient for most applications.

How does the infrastructure help us in that regard? The model layer promotes dockerized models. With a container scheduler (this can be Kubernetes or something more simple) and the load balancer in the gateway, we achieve already a great degree of scalability and can even scale on demand.

Efficient use of resources

When I was working at CERN, I was lucky to have access to the Worldwide LHC computing grid–a large-scale collaboration of data centers to bulk process 100s of petabytes of data. Therefore, I think it is very common, especially in academia, that there are already existing hardware resources: compute power, data storage, and GPU. You might have a GPU from the previous research project in the group. At least during my time in academia, I have seen requests for GPU resources, but I have not seen a grant application that mentioned funds for computing power from a private cloud provider.

If we pool resources from multiple projects together, we can smooth load peaks. It’s less likely that all our applications and models experience a spike in the number of requests at the same time. So if only one application or model is in demand at one point in time, that application or model can scale to the entire available hardware. So, with the infrastructure we can incorporate heterogeneous resources and make use of dedicated GPU resources. We can even share GPUs across multiple projects.

Due to the separation of CPU and IO-intensive tasks in the application layer from the GPU intensive inference of the models, we make sure that each type of resource is used in the best way. It doesn’t make sense to run CPU-intensive tasks on an expensive server with a GPU such that inference speed and throughput degrades due to the CPU load.

What’s more

Networking

The key idea for the AI infrastructure is a layered approach with minimal coupling between the application layer and the model layer. This all depends on fast network communication. The choice of communication protocol has a large impact. Two very common choices are HTTP+JSON (often incorrectly referred to as “REST”), and https://grpc.io/. Both have their individual strengths and advantages.

JSON is a verbose format. It’s manually editable and easily readable. There is almost universal support for the serialization and deserialization of JSON. Practically every programming language offers libraries to send HTTP requests. However, the verbose nature of the format becomes a disadvantage for the AI infrastructure. Requests to and responses from the ML models often consist of a high-dimension vector of floating point numbers (embedding vectors, classification results). In JSON, this needs to be converted to strings and parsed again into floats. This conversion introduces a CPU overhead. The string representation of a 4-byte floating point number requires about 10 bytes as a string, and therefore, requires more network bandwidth.

gRPC is much more concise due to its binary nature. Development with gRPC requires additional tooling and frameworks, however, generally, the support for gRPC is good. Is the additional development cost justified by a noticeable improvement in inference speed? Well, let’s measure.

The following benchmark computes text embeddings using a BERT-like model. The parameter on the $x$-axis is the number of parallel text inputs processed as a single batch by the model. The $y$-axis illustrates the average processing time required per text input. Using larger batches usually improves the per-item speed as any overhead associated with the request as a whole occurs only once per request and is shared by all items. Note that both axes are in a logarithmic scale. The benchmark is repeated four times under different conditions: (REST vs. gRPC) × (CPU vs. GPU)

Let’s focus on the CPU case (in orange) first. We see an increase in speed until around a batch size of 50 for both communication protocols. After that, it flattens out with a processing time of 20 ms for gRPC and 30 ms for REST. The result can be understood that we hit the limitation of the CPU for AI inference. The CPU cannot compute more embeddings in parallel. Additional reductions in the overhead are negligible compared to the processing time required by the CPU. In other words, the CPU becomes the bottleneck above a batch size of 50. gRPC is around 30 % faster in that regime.

Now, let’s switch to the case where inference is done on a GPU (in blue). One thing to notice is that inference times on the GPU are much faster than on the CPU, as expected. For the REST case, we seem to approach another plateau after a batch size of a few hundred. For gRPC, inference times improve even beyond the batch size of 1000 and gRPC ends up being 10x faster. This performance difference can be understood as a JSON bottleneck. The GPU could run more parallel inference, but we cannot encode, decode, and send data fast enough.

The benchmark nicely demonstrates the power of gRPC in this context and in my view justifies the additional development time to use gRPC as a communication protocol between the application, the gateway, and the model layer for CPU-based and GPU-based inference. If the gRPC client is embedded in the custom library for the gateway, the communication protocol is opaque to the applicant.

Bigger picture — MLOps

AI and ML projects are difficult. In contrast to classical software projects, we need to track changes in our

Code,
Data, and
Model (weights).

The established practice to manage these changes and iterate based a feedback is called MLOps, in my view, the ML equivalent of DevOps. A good description called Continuous Delivery for Machine Learning can be found on Martin Fowler’s block. The same practices are now transferred to even more domains and referred to as AIOps and PromptOps. How does the AI infrastructure fit into the picture?

The infrastructure described in this article is concerned with the second half of the MLOps live cycle, namely, the productionization of the model, handling application code, releasing the model and monitoring the model to production. Other very important aspects, like handling the training data set or training the model are beyond the scope of this production-centric infrastructure.

Conclusion

What’s next? The key idea is to build the AI infrastructure in layers to reduce coupling between applications and models from an operational point of view. The idea is not new or unique for AI, but brings enormous benefits, such as,

Versioned model releases,
Minimal setup to use model,
Scalable to Terabytes of data, and
Use of specialized GPU resources.

The infrastructure is used at MDPI for all its internal and external-facing AI products.

This article is based on a talk at the EuroSciPy 2024 conference in Szczecin, to be presented on August 28, 2024. Stay tuned.

Semantic versioning and it’s contradictions

2024-05-22T00:00:00+02:00

Semantic versioning, or SemVer, is great. There are alternative idea’s like Jacob Tomlinson’s EffVer. In this article, I’ll show-case two internal contradictions of SemVer. However, besides all the criticism, I still think it’s the best versioning method we have. The two examples, are probably not relevant in practice and the fact that people fail to use SemVer consistently is not an argument against it.

So what is SemVer? In short, it’s a way to version software packages, especially libraries. The version is a dot-delimited triple of the major version, minor version, and patch version, e.g., 1.15.2.

When comparing two versions of the library that differ only in the patch version, it means that a bug has been fixed in a backward-compatible way. Software depending on the library is highly encouraged to upgrade to the latest patch. Why wouldn’t you want a bug-fixed dependency?
When comparing two versions of the library that differ only in their minor version, it means that new, backward-compatible features have been added. In Python, this could be adding new functions, new methods, new arguments, etc. Any Python function, class, etc. that existed before is still there and can be used. In SemVer, it’s considered safe to upgrade dependencies to their latest minor version.
When comparing two versions of the library that differ in their major version, it means that the more recent version contains backward incompatible changes. Upgrades of a dependency to a higher major version potentially breaks existing code.

Patches are breaking changes

It’s tempting to assume that upgrading a dependency to the latest version within the same minor version is safe and will not break existing code. After all, according to Semantic Versioning (SemVer), as long as the minor version remains the same, any changes introduced in patch releases should be backward compatible. However, in practice, this assumption can be risky.

Most patch releases are intended to fix bugs or security vulnerabilities, which often involves changing the behavior of the library. If the behavior wasn’t altered, the bug couldn’t be resolved. This introduces a subtle but critical problem: while the patch might correct a flaw, it could simultaneously disrupt existing code that relied, perhaps inadvertently, on the buggy behavior. For example, consider a function previously returned null in certain edge cases due to a bug. This behavior was corrected to return a more appropriate value or raise an exception, any code that expected null could fail after the patch is applied.

This issue becomes even more complex when considering what constitutes “buggy behavior.” Is it a bug if the documentation was unclear or silent on how the library should handle certain edge cases? In such cases, what one developer might see as a bug, another might have come to rely on as a feature. This ambiguity can lead to unintended breaking changes in what are supposed to be non-breaking releases. This could be attributed to bad documentation, but I argue any documentation is ambiguous to some extent (the only thing not ambiguous is the code).

Moreover, performance optimizations, such as reducing the runtime complexity of an algorithm from O(n^2) to O(n log n), are typically considered safe improvements and often categorized under patch releases. However, even these changes can have unintended consequences. For instance, a performance optimization might alter the timing of certain operations, leading to subtle race conditions or other timing-dependent issues such as triggering API rate limits in the software that relies on the library.

The first inconsistency or subtly in Semver is that every patch release is a breaking change.

Relying on the absence of features

According to SemVer, adding backward-compatible new features warrants a minor version release (e.g., from 1.2.0 to 1.3.0). The idea is that such changes should not break existing code that relies on the previous version.

However, this assumption of backward compatibility is tricky. Which changes are truly backward compatible? In certain scenarios, adding a new feature can inadvertently change the behavior of existing code, even if no breaking changes were introduced to a library. This is particularly problematic in dynamic languages like Python, which rely heavily on duck typing, i.e., a programming concept where the suitability of an object for a specific operation is determined by the presence of certain methods and properties, rather than the object’s type.

Consider the following Python code snippet:

import numpy

def func(obj):
    """Compute sqrt of object."""
    if hasattr(obj, "sqrt"):
        return obj.sqrt()
    return numpy.sqrt(obj)

In this example, func checks if the argument obj has the method sqrt. If it does, func calls obj.sqrt(). If not, it defaults to using numpy.sqrt(obj), i.e., a well-established function from the numpy library. This approach works smoothly under the assumption that obj either has an appropriate sqrt method or it doesn’t have a sqrt method.

Now, let’s imagine that you rely on a library (here library) that provides the class A. At the time you wrote the code, the class A did not provide a sqrt method. Consequently, func would consistently use numpy.sqrt() when passed an instance of A. However, if library is updated to a new minor version, and this update introduces a sqrt method to the class A, the behavior of func changes—potentially in unexpected and undesired ways. We cannot predict how sqrt will be implemented in the future. It might require additional arguments or provide additional metadata as return values.

The second inconsistency or subtly in Semver is that adding features is never backward-compatible.

Sticky bits

2024-04-24T00:00:00+02:00

Native Linux file permissions are often denoted as “r” (read), “w” (write), and “x” (executable, which in the case of a directory means that a user can change into the directory). Treating them as binary flags with the values r = 4, w = 2, and x = 1, we can express any combination in a single digit. This scheme is repeated three times to denote independent permissions for the file owner, any user in the group of the file, and anyone else. For example, a file with permissions 0644 or rw-r--r--, is readable and writeable only by the owner (rw-). Users in the group of the file and anybody else has read-only access (r--). However, what does the leading 0 mean? There is more to it. I introduce to you: the sticky bit and its friends.

The leading zero indicates that we have yet another set of three binary flags that control the file’s permissions. Their meaning is sometimes very subtle.

What the bits mean

Let’s first check what each bit means. This depends on the type of the file objects, i.e., if it is a file or a directory. The behaviour of these three bits is also not consistent between all Linux clones.

setuid (`4xxx`)

Binary executable: Executable runs as the file-owner user, independent of who launches the process

setgid (`2xxx`)

Binary executable: Executable runs as the file-owner user, independent of who launches the process
Directory: New files and directories inherit the group of directory

sticky (`1xxx`)

Executable: Retain file in swap after exec (obsolete)
Directories: Only the owner and root can delete or rename files. (Otherwise, the directory owner can delete or move files)

How they are displayed

Command-line tools like ls encode the information in their usual output. The usual x to indicate an executable is dropped and replaced with s or t. If the file is not executable, S or T is used.

Executable files:

0xxx = ... ... ..T
1xxx = ... ... ..T
2xxx = ... ..S ...
4xxx = ..S ... ...

Non-executable files:

0xxx = ... ... ..t
1xxx = ... ... ..t
2xxx = ... ..s ..t
4xxx = ..s ... ..t

Nvidia library/driver version mismatch

2024-03-27T00:00:00+01:00

On systems with an NVIDIA GPU, a simple apt upgrade could leave you in a dreaded situation where the GPU still works, but the NVIDA tooling like nvidia-smi doesn’t work anymore. They just print an error message like

$ nvidia-smi
NVML: Driver/library version mismatch

This happens when apt upgrades the version of your tooling but the nvidia kernel modules were already in memory before, so they’re still running with the previous version. The usual approach is to reboot your machine. Sometimes this is not acceptable. In these cases, you can instead remove the now outdated kernel modules and load the updated version.

The solution

In most cases, running this sequence of commands is sufficient:

sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
nvidia-smi

The call to nvidia-smi reloads the required kernel modules automatically.

Troubleshooting

We cannot remove kernel modules if another module depends on it. We first need to remove all dependent modules. That’s why we remove nvidia_drm, nvidia_modeset, and nvidia_uvm first. If a module has additional dependents not considered in this article, its removal will fail. For example, if we tried to remove nvidia first, rmmod would print an error message. To find additional dependent modules of nvidia, run

lsmod | grep nvidia

Removing a module can also fail if a process is still using a device. In these cases, use lsof to get a list of these processes. For example, if the nvidia device plugin for Kubernetes prevents you from removing nvidua_uvm, you’ll find this out with

sudo lsof /dev/nvidia_uvm

httpx as a basic gRPC client

2024-02-28T00:00:00+01:00

gRPC is a concise, fast, and powerful remote procedure call protocol. It leverages Google’s ProtoBuf as the wire format and HTTP/2 as transport. Besides simple RPC invocations, gRPC supports additional metadata and client and/or server-side streaming of messages (providing support for server push messages). Getting started can be daunting. Understanding the internals is even more challenging. Assuming prior knowledge of HTTP/2, can we use HTTP/2 and ProtoBuf by hand to create a gRPC client? This article implements a rudimentary client for educational purposes based on the Python library httpx.

Protocol Buffer

HTTP/2 is used as the transport protocol of gRPC. However, before we can send our first gRPC message with a normal HTTP/2 client, we first need to understand its wire format: Protocol Buffers. This article uses the Hello World from the gRPC guides. The greeter service is deployed at grpc-helloworld.sauerburger.com. The full definition of the Greeter service is hosted on GitHub. The service is defined in ProtoBuf’s own syntax. Important for this article are the message structs. The service expects a HelloRequest message as input and returns a HelloReply. Let’s focus on the request. The message is defined as follows.

message HelloRequest {
  string name = 1;
}

Each request message has a single field, a name field of type string. The number assignment (=1) is ProtoBuf’s way of enumerating the field. The field id will be important in the following. We could, and in fact, you should always, use Protocol Buffer compiler protoc to generate serialization and parsing code for your programming language of choice and use that code to build request payloads. However, for this article, we will dissect the structure of the message by hand.

Let’s assume we want to send a request to the Greeter service with name = "World". We expect to get a “Hello World” greeting in response. How to encode this as a HelloRequest message?

One fundamental concept in ProtoBuf is the variable length integer or varint. How many bytes do you need to encode a 32-bit unsigned integer? Well, 4 bytes. If we know that most of the time we’ll be dealing with smaller numbers, is there a more efficient encoding? Yes: variable length integers. To encode an integer in that way, we need to chop it into 7-bit chunks (with padding to the left). Each chunk is prefixed by a single bit: 1 indicating that the next byte is also part of the same integer encoding, and 0 indicating that it’s the last byte. For example, the integer 150 encoded as a varint is \x96\x01 in Python byte string notation. Protocol Buffer makes heavy use of this technique. However, for our simple examples in this article, we don’t really need it.

With varints out of the way, let’s encode our name = "World" message. The first step requires encoding our string using UTF-8. Since our name consists solely of ASCII characters, the binary version of does not need further explanation.

Each field in a ProtoBuf message is encoded using the “tag-length-value” scheme. This means that it first specifies the field identifier, in our case 1, and its wire type, here a variable length string, then the string’s length as a varint, and finally the UTF-8-encoded string itself. The combination of field identifier and wire type is also encoded as a varint. However, in our example, the field information and the length are less than 128, so it looks like a normal 1-byte integer.

The HelloRequest message specifies only a single, mandatory, and non-repeatable field. The resulting bytes \x0a\x05World is exactly what ProtoBuf’s SerializeToString() function would return if we used protoc to generate the serializer code. You can verify this yourself by following this end-to-end example.

gRPC

The wire format of gRPC consists of more than just plain ProtoBuf messages. The first byte in a gRPC request indicates the compression algorithm for the whole body. We’ll stick to \x00 to disable compression. A gRPC body can contain a stream of ProtoBuf messages. Bidirectional message streaming is one of the strengths of gRPC. Each ProtoBuf message is prefixed by its length encoded as a big-endian 32-bit unsigned integer. This allows clients and servers to break the stream into individual ProtoBuf messages, but it also limits the size of each individual message. For our trivial service, we’ll not use message streams and content ourselves with just a single message.

The encoding procedure is illustrated in the following schema.

This was the fast introduction to Protocol Buffer’s wire format as we need it for a simple gRPC service. Feel free to dive into all the nitty gritty details.

The final message that we want to send as HTTP/2 body is \x00\x00\x00\x00\x07\x0a\x05World in Python byte string notation.

The last missing puzzle piece is the appropriate HTTP headers.

The client

It’s time to build our rudimentary client and send the message that we assembled in the previous section. The two key ingredients are

force HTTP/2 in the httpx client, and
the appropriate HTTP headers.

On a superficial level, gRPC is just HTTP with Protocol Buffer bodies. The first header that we need to take care of is the Content-Type. Set it to application/grpc. gRPC uses trailing headers, a HTTP feature, that’s currently not supported by httpx. Usually in HTTP, the header key-value are sent before the message body. With trailing headers, some header fields might be sent after the HTTP body. gRPC uses this for example to set the status code. A request might fail after an initial part of the body was already sent. The trailing status code header can indicate failure in these cases. The last header, the grpc-accept-encoding: identity instructs the server to not use any compression. It would complicate the decoding for use, and it probably wouldn’t make our message any smaller anyway. So, let’s put this together.

import httpx
client = httpx.Client(http1=False, http2=True)
response = client.post(
  "https://grpc-helloworld.sauerburger.io/helloworld.Greeter/SayHello",
  headers={
    "content-type": "application/grpc",
    "te": "trailers",
    "grpc-accept-encoding": "identity",
  },
  data=b"\x00\x00\x00\x00\x07\x0a\x05World"
)
print(response.content)

The script prints b'\x00\x00\x00\x00\r\n\x0bHello World'.

The response

To decode the response, we can follow the same steps as above, just in reverse.

00 (hex), 1 byte, no compression,
00 00 00 0d (hex), 32-bit integer, message length: 13 bytes,
0000 1010 (bin), var int, field identifier 1 in wire type 2, i.e. string or byte
0000 1011 (bin), var int, string length: 11 bytes,
Hello World (utf-8), string value of the message field.

In short, the server replied with “Hello World”.

Summary

Is it instructive, to reverse engineer the gRPC protocol with Protocol Buffer and httpx? Definitively.

Should you do this anywhere else? No, this is clearly only for educational purposes. For everything else, use the auto-generated gRPC client and server.

Retag docker images

2024-01-31T00:00:00+01:00

It is common practice to build Docker containers in CI pipelines using tools like Kaniko. It is also common practice to version Docker images with tags like 1.0.1, 1.0, 1, latest. At one point in time, all tags probably pointed to the same image. Is it possible to write a CI/CD job that retags images without downloading the full image first?

TL/DR: docker buildx imagetools create --tag NEW EXISTING

Yes. Adding a new tag doesn’t require access to a Docker daemon. It’s a matter of sending the correct API requests to the Docker registry. docker buildx has all the features we need.

Assuming the original image is tagged as $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG using GitLab CI/CD variables, the following lines retag an image with its major and minor version, e.g. 1.0.
```
export newtag=$(echo $CI_COMMIT_TAG | cut -d. -f 1-2)
docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:$newtag $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 
```

To retag the image just with the major version, e.g. 1, use the following lines.

export newtag=$(echo $CI_COMMIT_TAG | cut -d. -f 1)
docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:$newtag $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 

To retag the image just with latest use the following lines.

docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:latest $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 

The full GitLab pipeline could look like this.

stages:
- build
- tag

build:
  stage: build
  allow_failure: false
  image:
    name: gcr.io/kaniko-project/executor:v1.20.0-debug
    entrypoint: [""]
  rules:
    - if: $CI_COMMIT_TAG
  script:
    - echo "{\"auths\":{\"$CI_REGISTRY\":{\"auth\":\"$(printf "%s:%s" "${CI_REGISTRY_USER}" "${CI_REGISTRY_PASSWORD}" | base64 | tr -d '\n')\"}}}" > /kaniko/.docker/config.json
    - /kaniko/executor
      --context "${CI_PROJECT_DIR}"
      --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
      --destination "${CI_REGISTRY_IMAGE}:${CI_COMMIT_TAG}"


.tag-template:
  stage: tag
  image: docker:24.0.7
  rules:
    - if: $CI_COMMIT_TAG
      when: manual
  before_script:
  - mkdir -p $HOME/.docker
  - echo "{\"auths\":{\"$CI_REGISTRY\":{\"auth\":\"$(printf "%s:%s" "${CI_REGISTRY_USER}" "${CI_REGISTRY_PASSWORD}" | base64 | tr -d '\n')\"}}}" > $HOME/.docker/config.json

tag-latest:
  extends: .tag-template
  script:
  - docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:latest $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 

tag-major:
  extends: .tag-template
  script:
  - export newtag=$(echo $CI_COMMIT_TAG | cut -d. -f 1)
  - docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:$newtag $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 

tag-minor:
  extends: .tag-template
  script:
  - export newtag=$(echo $CI_COMMIT_TAG | cut -d. -f 1-2)
  - docker buildx imagetools create --tag $CI_REGISTRY_IMAGE:$newtag $CI_REGISTRY_IMAGE:$CI_COMMIT_TAG 

Have a look at the end-to-end example.

Kubernetes tricks

2024-01-03T00:00:00+01:00

This article is a collection of techniques that have proven valuable when interacting with a Kubernetes cluster, especially when developing or debugging applications deployed to the cluster.

Entering a Pod/Container to “poke around”

If a container or application doesn’t behave as expected, one thing to do during debugging is to exec-into the misbehaving container to inspect the environment, the file system (mounted or in the image).

The following command launches a shell inside the first container of a Pod

$ kubectl exec -it your-pod-name -- sh

Depending on the image, other shells might be available, like bash or zsh. During development, it is perfectly fine to install any required debugging tools inside the running Pod. You will find me installing vim in various containers during debugging. The key is: Containers are ephemeral. Once the root cause of the issue is found, it’s easy to recreate the Pod, and the debugging mess is gone. That’s even more comfortable than debugging in a local clone of a repository.

Once inside, the following list of actions have proven useful:

Check the version of the code (maybe it’s not the correct Docker image)
Check the environment variables: env | sort
Run part of the application by hand to see directly what it’s doing
Check if mounted volumes are available: mount
Check file and directory permissions, especially for mounted volumes
Inspect running processes:
- If available ps or top or htop
- Otherwise, have a look at /proc. There is a lot of information in /proc, for example running commands /proc/1/cmdline, their working directories /proc/1/cwd.
Connect to dependent services
- With curl URL for HTTP connections.
- With nc -zv IP PORT for plain TCP (and UDP) connections.
- With any application-specific tool, e.g., psql to debug PostgreSQL-related problems.

View content of a PVC

An easy solution to view the content of a PersistantVolumeClaim (PVC), is to mount the PVC in a dedicated container and exec-into the container. This solution works for virtually every storage provider.

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
spec:
  containers:
  - image: busybox
    name: pvc-inspector
    command: ["tail"]
    args: ["-f", "/dev/null"]
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount
  volumes:
  - name: pvc-mount
    persistentVolumeClaim:
      claimName: YOUR_CLAIM_NAME_HERE

More info on StackOverflow or using my templating service.

Entering a constantly failing Pod/Container

If a Pod is constantly failing at startup and ends up in the dreaded CrashLoopBackOff state, you cannot use kubectl exec -it to go into the container. Entering the container requires it to be running. If the pod is part of a managed resource, e.g., a Deployment, there is a trick to force the container to start.

Consider the following modification.

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   labels:
     app: test
   name: test
 spec:
   selector:
     matchLabels:
       app: test
   template:
     metadata:
       labels:
         app: test
     spec:
       containers:
       - image: "myapplication:0.1.0"
-        command: ["python", "myapp.py"]
+        command: ["tail", "-f", "/dev/null"]
         name: test

Using tail -f /dev/null is a common trick to block the execution of a container indefinitely. tail -f reads from the given file and waits until content is available. However, the special file /dev/null will always be empty.

With this modification, the Pod will enter the Running state, and entering the container is possible again. Start debugging by launching the original application, here, python myapp.py.

Edit the application before launching it

Let’s assume you need additional debugging output from an application. However, the application reads the logging level upon start, so once you enter the container, it’s already to late too change the log verbosity. The situation is similar when working in an interpreted language. You might want to add additional debug output, but once the application starts, it’s to late too change the script.

To tackle this situation, replace the default Pod command by tail -f /dev/null to prevent it from starting the application.

       containers:
       - image: "myapplication:0.1.0"
-        command: ["python", "myapp.py"]
+        command: ["tail", "-f", "/dev/null"]

If you enter the container now, there is plenty of time to do any modifications to the environment variables, config files, or scripts before launching the application by hand. Install the text editor of your choice.

Probing ports and services

Many applications in Kubernetes use network communication between individual services of the application or external services and clients. During debugging, it is essential to see if a connection is possible. To test external connection, e.g. via port forwarding, or internally after entering a container, use

$ nc -vz IP_ADDRESS_OR_HOSTNAME PORT

to check if the TCP handshake succeeds. The command can be used with IP addresses or hostnames to test DNS resolution at the same time. Test connections to Pod IP addresses and Service IP addresses (ClusterIP or NodePort).

Prepare a configuration file

When using pre-built Docker images and applications, it can be challenging at times to dynamically create a configuration file based on environment variables, config maps, or secrets. Usually, this is achieved with container entry points. However, entry points are usually already in use by pre-build Images, and templating tools to create the config might not be available in these images.

In these cases, I suggest creating an emptyDir PVC, adding an init container to the Pod, mounting the PVC in the init container, and the main container. The init job can be used to create the dynamic config in the PVC. The main container will be able to read the config from the PVC.

Force Deployment rollout if config maps change in Helm

When a config map changes in a Helm-based deployment, Pods consuming the config map are not necessarily restarted automatically. This will likely lead to unexpected behavior. To save you headaches and a debugging session, you can add the config map’s checksum as an annotation. Deployments will automatically restart their Pods if the config map and therefore the annotation changed.

kind: Deployment
spec:
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

See the Helm documentation for more details.

Auto-generate secrets in Helm

Almost all Helm-based applications require secrets. For example, to set up internal databases. It is convenient to create a secret upon installation, that is then shared between the database and the application consuming the database. This can be achieved using a switch based on .Release.IsInstall. When the chart is first installed, a random sercret is created. Every subsequent upgrade will read the existing secret from the cluster, yielding an identical manifest.

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: {{ print .Release.Name | trunc -63 }}
data:
  {{- if .Release.IsInstall }}
  SECRET_KEY: {{ randAlphaNum 20 | b64enc }}
  {{- else }}
  SECRET_KEY: {{ (lookup "v1" "Secret" .Release.Namespace (print .Release.Name | trunc -63)).data.SECRET_KEY }}
  {{- end }}

More?

What’s your favorite technique? Drop me an e-mail to extend the list.

Let’s encrypt certificates on internal machines

2023-12-06T00:00:00+01:00

A lot of system administrators are faced with the challenge of obtaining TLS certificates for internal machines that are not exposed to the public. In these scenarios, a domain is mapped to a private IP address by a link-local DNS server. For example, when a user on the internal network accesses https://example.com, their browser connects to 172.16.0.1:443. HTTP-based challenges don’t work with the internal machines. Let’s encrypt cannot connect to the internal servers. However, there are a few solutions to obtain TLS certificates from a CA.

Use an internal CA to sign certificates or self-signed certificates. The disadvantage is that each user on every device needs to trust the custom certificate.
Use Let’s Encrypt (or any other CA) with a challenge that doesn’t require HTTP access to that machine. For example, the internal server could interact with a DNS service to pass a DNS-based challenge and obtain certificates.
It is also possible to use an ad-hoc solution with a public-facing server that handles the public requests to https://example.org. A system administrator performs the HTTP-based challenge on the public counterpart and copies the certificates to the internal machine. Although this is a very simple solution, it requires manual actions every two to three months when the certificates approach their expiry date.

There is another lesser-know solution, that leverages Let’s Encrypt accounts.

Concept

The key insight is that once an account proves eligibility for a certificate, e.g., via the HTTP challenge, Let’s Encrypt will mint multiple certificates for the same domain. The following sketch illustrates the idea.

The public surrogate server, reachable via https://example.com, requests a certificate for example.com and proves eligibility via the HTTP challenge (1). Certificate renewal can be automated such that there is always a valid certificate on the server. This could be used to host a public-facing website in place of the internal application. The surrogate server uses a user account, here xyz, to obtain the certificate.

Let’s Encrypt keeps track of which account passed which challenge. The internal server, hosting an internal application, uses the same credentials and therefore also the same account in the communication with Let’s Encrypt. The internal server can now request again a certificate for example.com. The eligibility has already been established and Let’s Encrypt mints the certificate right away.

What sounds like a bug at first, is covered by ACME specification: RFC 8555.

The “authorizations” array of the order SHOULD reflect all authorizations that the CA takes into account in deciding to issue, even if some authorizations were fulfilled in earlier orders or in pre-authorization transactions. For example, if a CA allows multiple orders to be fulfilled based on a single authorization transaction, then it SHOULD reflect that authorization in all of the orders.

The specification does not mandate this behavior, but Let’s Encrypt supports it.

Implementation steps

On the public-facing surrogate server

Set up certbot in combination with a web server.
Request a certificate for the domain in question.
Automate certificate renewal, e.g., using certbot renew as a cron job.

On the internal server

Set up certbot.
Copy /etc/letsencrypt/accounts from the surrogate server to the internal.
Request a certificate for the domain in question.
Automate certificate renewal, e.g., using certbot renew as a cron job.

Done.