- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
Code: Select all
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
User-agents and bots
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Code: Select all
User-agent: *
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
- To block the entire site, use a forward slash.
Code: Select all
Disallow: /
- To block a directory and everything in it, follow the directory name with a forward slash.
Code: Select all
Disallow: /junk-directory/
- To block a page, list the page.
Code: Select all
Disallow: /private_file.html
- To remove a specific image from Google Images, add the following:
Code: Select all
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
- To remove all images on your site from Google Images:
Code: Select all
User-agent: Googlebot-Image Disallow: /
- To block files of a specific file type (for example, .gif), use the following:
Code: Select all
User-agent: Googlebot Disallow: /*.gif$
Sitemap files can also be submitted using robot.txt. Advanced search engines like Google support this feature.
Code: Select all
Sitemap: http://www.example.com/sitemap-host1.xml
Sitemap: http://www.example.com/sitemap-host2.xml
Not all search engines respects some pattern matching (Again Google supports this feature).
- To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
Code: Select all
User-agent: Googlebot Disallow: /private*/
- To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
Code: Select all
User-agent: Googlebot Disallow: /*?
- To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
Code: Select all
User-agent: Googlebot Disallow: /*.xls$
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).Code: Select all
User-agent: * Allow: /*?$ Disallow: /*?
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).