![]() |
| |||||||
| Web Hosting Discussions on all aspects of web hosting, choosing a host, reviews and technical things. |
![]() |
| | LinkBack | Thread Tools | Display Modes |
| |||
| Here is an example of Google's robots.txt If that helps any... User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues Use the javascrpit command to easily find robots.ext file on any website... javascript:void(location.href='http://' + location.host + '/robots.txt') |
| |||
| Hi Robots txt is just a text file which tells robots (which are automated indexers for search engines know where to visit) If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags. STUFF TO TAKE INTO CONSIDERATION ================================= ► If you want to tell robots (major search engines) which pages to follow or which pages not to follow you have to include this meta tag within the header portion of the site. <meta name="robots" content="all" /> ► If you wish to NOT allow robots index <meta name="robots" content="noindex, nofollow" /> What that does, it tells the search engine NOT to index that page and NOT to follow the links ► ROBOTS.TXT in root User-agent: Disallow: Where user-agent is the robot name Where disallow is the expression to tell the user-agent what to not index or follow The * means ALL USER AGENTS (google msn yahoo ...) The / in Disallow means 'entire server' EXAMPLES ================================= ► To exclude all robots from the entire server User-agent: * Disallow: / ► To allow all robots complete access User-agent: * Disallow: Or create an empty "/robots.txt" file. ► To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/ ► To exclude a single robot User-agent: BadBot Disallow: / ► To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: / ► To exclude all files except one This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory: User-agent: * Disallow: /~joe/docs/ Alternatively you can explicitly disallow all disallowed pages: User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html I hope I helped... |
![]() |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| robots.txt file and ranking? | kiran | Web Hosting | 3 | 05-11-2007 05:56 PM |
| robots.txt and sitemap question? | Eric_Storm | Web Hosting | 2 | 04-26-2007 12:03 AM |
| What does a robots.txt file do? | ab909 | Web Hosting | 6 | 10-02-2006 12:19 PM |
| Does robots.txt have any legal precedent? | Rachna | Web Hosting | 4 | 08-11-2006 07:46 PM |
| How do I keep Yahoo robots from scanning my webpage? I have a robots.txt file in place, but it's not working. | t4x | Web Hosting | 2 | 03-28-2006 04:32 PM |