TL;DR: robots.txt file at site root tells crawlers which URLs to crawl/skip. Syntax: User-agent, Allow, Disallow, Sitemap. Common mistakes: blocking CSS/JS, wildcards misuse.
Basic Syntax
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xmlCommon Directives
- User-agent: bot to apply rules to. * = all.
- Disallow: paths to not crawl.
- Allow: exceptions within Disallow.
- Sitemap: location of sitemap.xml.
- Crawl-delay: wait between requests (Googlebot ignores, Bingbot respects).
Specific Bots
- Googlebot
- Bingbot
- Slurp (Yahoo)
- DuckDuckBot
- Baiduspider
- GPTBot (OpenAI)
- ClaudeBot (Anthropic)
- CCBot (Common Crawl)
Common Use Cases
Block admin areas
Disallow: /wp-admin/Block search results
Disallow: /?s=Allow only Googlebot to crawl
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /Critical Mistakes
- Blocking /wp-content/. breaks images, CSS.
- Blocking CSS/JS. Googlebot can't render.
- Wildcards. limited support.
- Case sensitivity. URLs are case-sensitive.
Testing
GSC → Settings → robots.txt Tester. Check URL behavior.