Exactly one year ago, Jeremy Howard published a proposal to make the web more accessible to AI and, in particular, to LLMs. How many of the top one million websites adopt this approach?

The proposed standard suggests creating a file at the root of a website, e.g., /llms.txt, intended to be consumed by LLMs and AI tools, loosely taking inspiration from /robots.txt. The /llms.txt serves as an entry point or site map of the website, potentially linking to other pages. As the source code of a webpage is often very verbose and its content is mingled with style sheets, JavaScript, and HTML markup, parsing the source with an LLM might exceed the LLM’s content window or consume too many tokens. Therefore, the idea is to use Markdown for the /llms.txt entry point and to link to Markdown versions of each page.

How many websites adopted this approach?

Let’s measure.

Starting with a dataset of the 10 million highest-ranked domains from Open Page Rank, we can send a GET request to /llms.txt for each of them and see how many web servers respond with an HTTP success code. It turns out, a lot of web servers respond with a 200 status code but actually send a page informing the client that the page doesn’t exist. For example, the top-ranked domain, facebook.com, behaves in that way. A GET request to https://facebook.com/llms.txt returns a 200 status code, but the pages says the content is not available.

Unusual behavior of facebook.com

A typical feature of these fake-success responses is that the response page is an HTML document. However, sometimes the Content-Type header field is not a reliable discriminator to detect fake-success pages. A good way to distinguish HTML content from Markdown is to compare the number of occurrences of the left angle bracket (<) character to the left square bracket ([). The first is ubiquitous in HTML, while the latter is common in Markdown. I came up with the following somewhat arbitrary rules. Only if a response satisfies all of them, I count it as a valid /llms.txt response.

  • HTTP response code must be less than 400
  • Content-Type header must start with text/plain
  • The web server must accept the connection within 3 seconds
  • The response must arrive within 10 seconds
  • The site must support HTTPS
  • The response content must be longer than 500 chars
  • There must be more left square brackets than left angle brackets

To speed up the analysis, it limited it to the top one million domains. I ran the analysis on September 2, 2025.

Results

Domains ranked high in the domain ranking might be faster to adopt new technological ideas. To test this, I counted the fraction of domains with /llms.txt among the top n domains. The result is shown in the following chart.

Barchart of the fraction of domains with llms.txt among the top n domains

Based on these results, we see that the largest adoption rate at 4 % is among the top 300 domains. The fraction continuously decreases further down the ranking. Looking at the top one million domains, we see that the overall adoption rate drops to around 1.2 %. In total, that corresponds to 12174 domains.

The crawler, the analysis code, and the result dataset are available in a Git repository.