Thoughts about GDPR

Privacy and protection of personal data is a very important topic in my opinion. The new European General Data Protection Regulation (GPDR) is probably a step in the right direction. However, some of the regulations sound rather vague and bureaucratic and therefore there has been a lot of confusion about how the rules will be applied once in action. It depends on the courts if GPDR will mainly bother owners of small blogs or websites of clubs, or if GPDR will force big data companies to implement and apply efficient privacy protection. In this post, I’d like to discuss a few thoughts about GPDR, to what does it apply and which type of service might not be possible anymore.

I am not a lawyer, so what I am discussing here is not a legal advice and might be completely wrong from a legal point of view. These are just my thoughts from the point of view of a software engineer.

GPDR is not new. There was a transition period of two years before it became effective. So everybody had two years to figure out what they need to do to make their service comply with the GPDR. To be honest, I didn’t use two years to prepare and therefore I still have a lot of open questions.

Logging IP addresses

The IP addresses of users are personal information according to current court ruling. This means logging of IP address of a webserver constitutes processing of personal information. The security of a web service is a legitimate purpose of logging these IP addresses for a limited time. The user must be informed about this in the web site’s “privacy policy”. So far so good.

Development or private applications

There are various reasons for hosting a service on a publicly available server even though the service is not meant to be public. You, and maybe a very small hand-picked, trusted circle of people, are the only ones who use the service. You would like to access the service anywhere anytime, which means it should be hosted on a publicly accessible server.

It might also be the case, that you are still in the developing phase of a web application and your primary focus is (probably) not the details of the privacy policy in the fine print. Usually, development happens offline in a private network. However, if you want to use an encrypted connection, it is probably easiest to make the service public and get certificates from letsencrypt. (Of course, you can also sign your own certificates and distribute them during the development phase in a private network, but I find this rather cumbersome).

To prevent random people from using a service which is still under development or not meant to be public. Depending on the project, I usually configure HTTP Basic Authentication or client certificate verification in HTTPS. In the former case, this means, that the server responds with a 401, if a client does not supply the correct user credentials. In the latter case, this means, that the browser asks the client to select a certificate.

Because of to the rather user-unfriendly password protection dialog, it is obvious that the site is not meant for random visitors. However, the web server creates log files and stores the IP addresses of all visitors. This is especially important to detect brute force attacks against the password protection. I suppose this constitutes a legitimate purpose, however, the user is not informed about this. Does this then violate GPDR?

You might think, that this example is far-fetched. But I think you can extend this even further. Besides password-protected websites, there are also a lot of websites out there, simply responding with a 403 or 404. This happens easily if you add a wildcard DNS record. Every subdomain will point to the same server, which is unlikely to respond in a sensible way for every subdomain. Take GitLab Pages as an example. Got to does-not-exist.sauerburger.io. There are no privacy policies published since the site is not meant to exist, however, the IP address of visitors is potentially stored. Is this a violation of GDPR?

Non-HTTP services

GPDR is mostly discussed in the context of web applications and web sites published via HTTP(S). However, this is just one protocol. What about IMAP and SMTP. Mail servers create log files. This is important for the security of mail servers and therefore constitutes a legitimate purpose. However, how can you inform users about this? Where should you put the privacy policies? If the mail server is rented to clients, it is easy to inform all clients about the storage of IP addresses. However, I don’t see a practical way to inform random clients, which try to connect to the server. These clients are most likely spam bots, but how can I know?

This line of thought is not limited to mail servers. Think about SSH. Connections attempts to SSH daemons are logged. In most cases, the SSH service is not meant to be publicly used. Most (failed) login attempts are probably from bots. Do you need to inform brute force attackers about your privacy policy? In case of SSH, some people configure their server, to store IP addresses for a very long time. Think about fail2ban, which might add firewall rules to block offenders for a long time.

Emails

If personal information is transmitted, it has to be encrypted. A lot of websites switched to HTTPS shortly before the deadline of GDPR. This means any personal information exchanged via HTTPS between the client and the server is encrypted.

Emails are usually transmitted and forwarded from server to server via SMTP on port 25. This port does not require encryption. Most mail servers support STARTTLS by now, but there is no guarantee that this is actually used when you send an email to someone. Even if you encrypt the content of the message, it will be very difficult to hide the names and email addresses of the sender and recipient. Is it allowed that someone sends me an email, without knowing whether the transport channel is encrypted at all time?

What about storing emails locally? I wouldn’t entrust my phone with sensitive personal information, however, I store emails and email addresses from over ten year ago. Do I have to delete them after a certain time? On the device only or also on the IMAP server?

Data removal

GDPR gives people the right to delete personal information. Furthermore, people even have a broader right of data removal under the “Right to be forgotten”. This right easily conflicts with core principles of some web services.

Git

I love Git’s data model since I was first introduced to it. Git in combination with a powerful web interface changed my way of coding and made collaborating with other people a breeze. Assume you host a publicly available Git based web interface, and a client wants that his personal data is removed from the site. Removing a user’s account should be fairly easy. However, does this mean that all issues and comments have to be removed as well? This would certainly have the potential to break the development process of a whole project. Even worse: Is it necessary to remove all the commits of that user because commits contain the user’s name and email address. Removing or modifying commits from a repository is nearly impossible. Violently ripping out commits from a repository, changes all subsequent commit hashs. Any local clone or any public fork would be out of sync which leads to conflicts, and they would still contain the personal information. If the original repository contained signed commits, you would loose all the signatures.

You see that removing personal data from a repository is a nightmare. It is not clear to me whether the fact the user entered this information voluntarily changes the situation because in GDPR in most cases the processing of private data must have the person’s consent anyway.

Let’s look at the privacy policy of famous providers of Git-repository-based hosting services. GitHub’s privacy statement does not mention the above dilemma as far as I can tell. GitLab considers this topic in their privacy policy.

Please note that due to the open source nature of our products, services, and community, we may retain limited personally-identifiable information indefinitely. For example, if you provide your information in connection with a blog post or comment, we may display that information even if you have deleted your account as we do not automatically delete community posts. Also, as described in our Terms of Use, if you contribute to a GitLab project and provide your personal information in connection with that contribution, that information (including your name) will be embedded and publicly displayed with your contribution and we will not be able to delete or erase it because doing so would break the project code.

The topic is also addressed in their terms of use.

As part of your voluntary contribution to any GitLab project, by agreeing to these terms, you are acknowledging and agreeing that your name and email address will become embedded and part of the repository, which may be publicly available. You understand the removal of this information would be impermissibly destructive to the project and the interests of all those who contribute, utilize, and benefit from it. Therefore, in consideration of your participation in any project, you understand that retaining your name and email address, as described above, does not require your consent and that the right of erasure, as spelled out in the GDRP Article 17 (1) b does not apply. The legal basis for our lawful processing of this personal data is Article 6 (1) f (“processing is necessary for the purposes of the legitimate interests pursued by the controller”).

In short, they say, that the right of erasure does not apply and that they retain the information for legitimate purposes. As I said, I am not a lawyer, but I don’t know if is that easy to curtail someone’s rights.

There are two interesting issues on GitLab about this topic:

Keyservers

Another hot topic are key servers. When personal data are transmitted, they must be encrypted. One possible encryption method is OpenPGP and its GnuPG implementation. Optimally, OpenPGP requires the prior distribution and validation of public keys before you communicate secretly. An important tool here are key servers. Key servers store public keys. Many key servers around the world exist and synchronize their databases. Anybody can enter keys. It is not possible to delete keys (and it should not be possible).

There are two issues now. If you add your public key including your email address to a key server, you can not delete it anymore. It will be synchronized to other key servers around the world really quickly. This means that the information is transmitted to servers outside the EU which do not comply with the privacy shield framework.

It gets even worse. Anybody can submit anyone’s public key and email address to a key server. The information is published and shared without the user’s consent.

I don’t see how this model can comply with GDPR. However, the operation of key servers is essential for OpenPGP which plays an important role in terms of privacy protection.

There is a discussion about this on “all the cool pgp [mailing] lists”.

Block Chains

“Block chain” has certainly become a buzzword. The issue with block chains is similar to the previous one. Many chains require a proof of work. This means roughly, that a new block is added to the chain if the author can proof that he (his CPU) had to work for quite some time to create the block. This mechanism is used to prevent the retroactive alteration of previous blocks. And here we are back to the previous problem: it is impossible (or very, very CPU expensive) to delete or modify previous blocks. Previous blocks might contain personal information. Consider the BitCoin block chain as an example. I think BitCoin addresses, like IP address, constitute personal data. BinCoin blocks clearly contain addresses, which makes me question how BitCoin is considered under GDPR.

The problem is not new. There is already illegal content in the block chain of BitCoin, which might make its storage (not even its distribution!) illegal in Germany.

Summary

In summary, I think there were two main issues:

Necessary logging of IP addresses of any kind of service without the users knowledge.
Removal of personal information is systems that were designed to aggregate information without the possibility of removal.

I don’t think that any of these issues are the prime targets of GDPR, however courts will decide about this.

Update 2021-06-27

The SKS keyserver pool website announced on 2021-06-21:

Due to even more GDPR takedown requests, the DNS records for the pool will no longer be provided at all.

It seems like my predictions or worries about data privacy with key servers were on point.