Technology Answer: Quick Robots.txt question

Will the following robots.txt syntax correctly block all pages on the site that end in "_.php"? I don't want to accidentally block other pages.

User-Agent: *    
Disallow: /*_.php

Also, am I allowed to have both "Allow: /" and "Disallow:" commands in the same robots.txt file? Thanks!

From serverfault bccarlso

If you want certain files (but not others) excluded, you must group them into directories, e.g.:
```
User-agent: *
Disallow: /cgi-bin/
```
Per robotstxt.org, asterisks are not supported in the "Disallow" field:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

Additionally, there is no such thing as an "Allow" field. Everything is allowed by default, and specific items are disallowed by exception.

References:
- http://en.wikipedia.org/wiki/Robots_exclusion_standard
- http://www.robotstxt.org/robotstxt.html
bccarlso : I'm developing some pages that I don't want crawled yet, and they exist on the server in the form of filename_.php instead of filename.php. How would you propose I block those files from being crawled? A meta tag in each one's ?

Miles Erickson : Standard practice would be to build separate development and test environments. In a full-on professional operation, nothing is moved to an internet-facing production server until it is ready for prime time. Options include installing XAMPP on your development workstation and using it as a development environment (http://localhost), or creating a subdomain on your hosted site (http://test.yourdomain.com/), or perhaps even just uploading a copy of the site to a subfolder (http://www.yourdomain.com/test/). You can use .htpasswd to enable minimal security for an internet-facing test site.

Miles Erickson : Another thought: it's also important to note that robots.txt is a voluntary system to "suggest" that certain pages should not be crawled. It may stop good crawlers, but it certainly will not stop evil ones.

bccarlso : Thanks Miles. Yeah, I had heard that robots.txt is not a hard rule, and also know I should have developed in an offline or protected online environment. It was one of those "was just going to be one page but turned into the whole site" redesigns/redevelopments, and it was my fault for not moving it offline. Thanks for the help.

From Miles Erickson
Miles' answer covers the standards. The most famous crawler, Googlebot, extends the standards and does understand Allow as well as (limited) pattern matching.

I find Google's webmaster tools quite helpful. They have a whole tool devoted to just helping you build a correct robots.txt. You do need to have the pages (or at least stub test pages) uploaded before you can run a "robots.txt test", though.

From Stephen Cleary

Technology Answer

Thursday, February 3, 2011

Quick Robots.txt question

0 comments:

Post a Comment

Blog Archive