Thursday, February 3, 2011

Quick Robots.txt question

Will the following robots.txt syntax correctly block all pages on the site that end in "_.php"? I don't want to accidentally block other pages.

User-Agent: *    
Disallow: /*_.php

Also, am I allowed to have both "Allow: /" and "Disallow:" commands in the same robots.txt file? Thanks!

  • If you want certain files (but not others) excluded, you must group them into directories, e.g.:

    User-agent: *
    Disallow: /cgi-bin/

    Per robotstxt.org, asterisks are not supported in the "Disallow" field:

    Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

    Additionally, there is no such thing as an "Allow" field. Everything is allowed by default, and specific items are disallowed by exception.

    References:

    bccarlso : I'm developing some pages that I don't want crawled yet, and they exist on the server in the form of filename_.php instead of filename.php. How would you propose I block those files from being crawled? A meta tag in each one's ?
    Miles Erickson : Standard practice would be to build separate development and test environments. In a full-on professional operation, nothing is moved to an internet-facing production server until it is ready for prime time. Options include installing XAMPP on your development workstation and using it as a development environment (http://localhost), or creating a subdomain on your hosted site (http://test.yourdomain.com/), or perhaps even just uploading a copy of the site to a subfolder (http://www.yourdomain.com/test/). You can use .htpasswd to enable minimal security for an internet-facing test site.
    Miles Erickson : Another thought: it's also important to note that robots.txt is a voluntary system to "suggest" that certain pages should not be crawled. It may stop good crawlers, but it certainly will not stop evil ones.
    bccarlso : Thanks Miles. Yeah, I had heard that robots.txt is not a hard rule, and also know I should have developed in an offline or protected online environment. It was one of those "was just going to be one page but turned into the whole site" redesigns/redevelopments, and it was my fault for not moving it offline. Thanks for the help.
  • Miles' answer covers the standards. The most famous crawler, Googlebot, extends the standards and does understand Allow as well as (limited) pattern matching.

    I find Google's webmaster tools quite helpful. They have a whole tool devoted to just helping you build a correct robots.txt. You do need to have the pages (or at least stub test pages) uploaded before you can run a "robots.txt test", though.

0 comments:

Post a Comment