Robots.txt Generator: Control Search Engine Crawlers with Custom Rules

Your robots.txt file is the first thing most web crawlers read when they visit your site. It determines which pages get indexed, how fast crawlers can access your content, and where your sitemap lives. Getting it wrong means either blocking important content or exposing pages you want to keep private from crawlers.

Our generator creates valid robots.txt files for any use case — standard websites, e-commerce stores, blogs, or AI bot blocking — with presets and common path shortcuts.

Formula

robots.txt format: User-agent: [crawler name or *] Allow: [path to allow] Disallow: [path to block] Crawl-delay: [seconds] Sitemap: [absolute URL to sitemap.xml]

Robots.txt file structure:

Robots.txt Best Practices

Always include:
- User-agent: * as a catch-all
- Sitemap URL for search engines
- Disallow for admin/private paths

Avoid:
- Blocking CSS/JS files (Google needs them to render pages)
- Blocking all of /api/ if some endpoints return public data
- Using robots.txt as a security measure (it is public)
- Relying on it to prevent indexing (use noindex instead)

Blocking AI Training Crawlers

Add separate blocks for AI training bots:
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

This opts your content out of AI training datasets. Use the User-agent dropdown to configure rules per bot, then combine the sections manually.

Practical Examples

Standard website

1.Preset: Standard Website
2.Disallow: /admin/, /private/, /api/
3.Sitemap: your sitemap.xml URL
4.Deploy: to yoursite.com/robots.txt

E-commerce store

1.Preset: E-commerce Store
2.Allow: /products/, /categories/
3.Disallow: /checkout/, /cart/, /account/
4.Crawl-delay: 2 (protect server load)

Frequently Asked Questions

What is a robots.txt file?

A robots.txt file is a plain text file placed in the root of your website (https://yoursite.com/robots.txt) that tells web crawlers which pages or directories they are allowed or not allowed to access. It follows the Robots Exclusion Protocol (REP) standard. Note: it is a suggestion, not an enforcement mechanism - malicious bots ignore it.

Does robots.txt prevent pages from being indexed?

Disallowing a URL in robots.txt prevents crawlers from visiting it, but does not guarantee the page won't be indexed. Search engines can still index a page if other sites link to it, even without crawling it. To truly prevent indexing, use the noindex meta tag or X-Robots-Tag response header on the page itself.

What is User-agent: * and when should I use it?

User-agent: * applies rules to ALL web crawlers. Use this for general rules that apply to all bots. You can also create specific blocks for individual crawlers (e.g., User-agent: Googlebot) to have different rules for different search engines. Multiple blocks are supported - just add separate sections in the file.

What is Crawl-delay and should I use it?

Crawl-delay tells crawlers how many seconds to wait between requests to avoid overwhelming your server. Googlebot ignores Crawl-delay in robots.txt (configure crawl rate in Google Search Console instead). Bingbot and other crawlers may respect it. Use it if your server is experiencing high load from bot traffic.

How do I block AI training bots like GPTBot?

Add a separate User-agent block for the specific AI bot: User-agent: GPTBot followed by Disallow: /. Known AI training crawlers include GPTBot (OpenAI), Google-Extended (Google AI), Claude-Web (Anthropic), CCBot (Common Crawl). You can add multiple user-agent blocks in the same robots.txt file.

What paths should I always disallow?

Common paths to block: /admin/ (admin panels), /wp-admin/ and /wp-login.php (WordPress), /checkout/ and /cart/ (e-commerce), /account/ and /login/ (user accounts), /api/ (API endpoints), /search/ (infinite crawling risk), /tmp/ and /private/ (sensitive dirs), /*.pdf$ or /*.xml$ (file type patterns).

Does the order of Allow and Disallow rules matter?

Google uses the most specific rule. For other crawlers, the first matching rule wins. Best practice: list Allow rules before Disallow rules, and be specific. For example, Allow: /admin/public/ before Disallow: /admin/ ensures Google can access the public subfolder.

Where do I put the robots.txt file?

The file must be in the exact root of your domain: https://yoursite.com/robots.txt. It only controls crawling for that domain and its subpaths. A file at https://yoursite.com/blog/robots.txt has no effect. Most hosting platforms, CMS systems (WordPress, Shopify), and web frameworks support placing it in the document root.

Robots.txt Generator - Crawler Control Tool