Simple robot.txt tutorial
Posted: Sat Mar 13, 2010 7:07 am
The simplest robots.txt file uses two rules:
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
In this example only the URLs matching /folder2/ would be disallowed for Googlebot.
User-agents and bots
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Blocking user-agents
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
Sitemap files can also be submitted using robot.txt. Advanced search engines like Google support this feature.
Pattern matching
Not all search engines respects some pattern matching (Again Google supports this feature).
- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
Code: Select all
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
User-agents and bots
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Code: Select all
User-agent: *
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
- To block the entire site, use a forward slash.
Code: Select all
Disallow: /
- To block a directory and everything in it, follow the directory name with a forward slash.
Code: Select all
Disallow: /junk-directory/
- To block a page, list the page.
Code: Select all
Disallow: /private_file.html
- To remove a specific image from Google Images, add the following:
Code: Select all
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
- To remove all images on your site from Google Images:
Code: Select all
User-agent: Googlebot-Image Disallow: /
- To block files of a specific file type (for example, .gif), use the following:
Code: Select all
User-agent: Googlebot Disallow: /*.gif$
Sitemap files can also be submitted using robot.txt. Advanced search engines like Google support this feature.
Code: Select all
Sitemap: http://www.example.com/sitemap-host1.xml
Sitemap: http://www.example.com/sitemap-host2.xml
Not all search engines respects some pattern matching (Again Google supports this feature).
- To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
Code: Select all
User-agent: Googlebot Disallow: /private*/
- To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
Code: Select all
User-agent: Googlebot Disallow: /*?
- To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
Code: Select all
User-agent: Googlebot Disallow: /*.xls$
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).Code: Select all
User-agent: * Allow: /*?$ Disallow: /*?
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).