Will the following robots.txt syntax correctly block all pages on the site that end in "_.php"? I don't want to accidentally block other pages.
User-Agent: *
Disallow: /*_.php
Also, am I allowed to have both "Allow: /" and "Disallow:" commands in the same robots.txt file? Thanks!
-
If you want certain files (but not others) excluded, you must group them into directories, e.g.:
User-agent: * Disallow: /cgi-bin/
Per robotstxt.org, asterisks are not supported in the "Disallow" field:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
Additionally, there is no such thing as an "Allow" field. Everything is allowed by default, and specific items are disallowed by exception.
References:
bccarlso : I'm developing some pages that I don't want crawled yet, and they exist on the server in the form of filename_.php instead of filename.php. How would you propose I block those files from being crawled? A meta tag in each one's ?Miles Erickson : Standard practice would be to build separate development and test environments. In a full-on professional operation, nothing is moved to an internet-facing production server until it is ready for prime time. Options include installing XAMPP on your development workstation and using it as a development environment (http://localhost), or creating a subdomain on your hosted site (http://test.yourdomain.com/), or perhaps even just uploading a copy of the site to a subfolder (http://www.yourdomain.com/test/). You can use .htpasswd to enable minimal security for an internet-facing test site.Miles Erickson : Another thought: it's also important to note that robots.txt is a voluntary system to "suggest" that certain pages should not be crawled. It may stop good crawlers, but it certainly will not stop evil ones.bccarlso : Thanks Miles. Yeah, I had heard that robots.txt is not a hard rule, and also know I should have developed in an offline or protected online environment. It was one of those "was just going to be one page but turned into the whole site" redesigns/redevelopments, and it was my fault for not moving it offline. Thanks for the help.From Miles Erickson -
Miles' answer covers the standards. The most famous crawler, Googlebot, extends the standards and does understand Allow as well as (limited) pattern matching.
I find Google's webmaster tools quite helpful. They have a whole tool devoted to just helping you build a correct robots.txt. You do need to have the pages (or at least stub test pages) uploaded before you can run a "robots.txt test", though.
From Stephen Cleary
0 comments:
Post a Comment