- robots.txt directives
- The set of instructions in a robots.txt file: User-agent, Allow, Disallow, Crawl-delay, and Sitemap. Each directive has a name and value separated by a colon.
- User-agent
- An identifier string declared in robots.txt that specifies which crawler the following rules apply to. '*' matches all crawlers; specific names (Googlebot, Bingbot) override the wildcard.
- Allow
- A directive that explicitly permits a URL pattern to be crawled, often used to override a broader Disallow rule. Google honors Allow; some older crawlers do not.
- Disallow
- A directive that tells the matching user-agent not to fetch URLs matching the pattern. Empty Disallow value (Disallow:) is a no-op.
- Sitemap declaration
- A line of the form 'Sitemap: https://example.com/sitemap.xml' that tells crawlers where to find the XML sitemap. Independent of User-agent groups.
- Wildcard (*)
- A pattern character that matches any sequence of characters in a URL path. Google supports * in both Allow and Disallow values.
- End-of-URL ($)
- A pattern character that anchors the match to the end of the URL path. 'Disallow: /*.pdf$' blocks .pdf URLs but allows /file.pdf?query=1.
- Crawl-delay
- A directive specifying minimum seconds between successive crawler requests. Honored by Bing and Yandex but ignored by Google (which throttles automatically).
- RFC 9309
- The IETF standard published in 2022 that formalized the robots.txt protocol after decades of de-facto implementation. Defines parsing, precedence, and limits.