What is Robot.txt File

Table of Contents

Robots.txt-File---SEM-&-SEO-Consultant

Robots.txt File - Everything to know about

The robots.txt file placed in the root of a website tells search engine crawlers which URLs the crawler can access on your site. Sitemap. Using a no-index tag, can block Google crawlers from indexing the website.

If you want to hide or unhide one of your pages from search engines, search for specific instructions as per the CMS you are working with.

If your web page is blocked with a robots.txt file, its URL can still appear in search results without a description. Image files, video files, PDFs, and other non-HTML files will be excluded. If you see this search result for your page and want to fix it, remove the robots.txt entry blocking the page.

If you want to hide the page completely from Search, you may use the Google Search Console Removals tool to remove a page hosted on your site from Google search results within a day.

In a nutshell - Robots.txt file Definition

Website owners use the /robots.txt file to give instructions about their site to web robots.

It works like this: a robot wants to vist a Web site URL, say https://www.ZabiNiazi.com/SEO, Before it does so, it firsts checks for https://www.ZabiNiazi.com/robots.txt, and finds:

User-agent: *
Disallow: /

The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

Robots.txt file Considerations

There are two important considerations when using /robots.txt:

  • Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use, suggest not using/robots.txt to hide information.
Robots.txt file

How to create a /robots.txt file & Where to put

The short answer: is in the top-level directory of your web server.

The longer answer:

When a robot looks for the “/robots.txt” file for the URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “https://www.ZabiNiazi.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “https://www.ZabiNiazi.com/robots.txt”.

So, as a website owner, you need to put it in the right place on your web server for that resulting URL to work. Usually, that is the same place where you put your website’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lowercase for the filename: “robots.txt“, not “Robots.TXT.

What to Include in a Robots.txt File

The “/robots.txt” file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~zabi/

In this example, three directories are excluded.

Note that you need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:

(or just create an empty “/robots.txt” file, or don’t use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
To exclude all files except one

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:

User-agent: *
Disallow: /~zabi/stuff/

Alternatively, you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~zabi/junk.html
Disallow: /~zabi/foo.html
Disallow: /~zabi/bar.html

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Seasonality-Planning---Google-Ads
SEM | Search Engine Marketing Blogs
Zabi Niazi

3 Strategies Help Seasonality Planning

Seasonality Planning Strategies During seasonal times, it’s important to focus on campaign efficiency while also connecting with your potential customers.  Three strategies help Seasonality Planning:

Read More »
Backlinks
SEM | Search Engine Marketing Blogs
Zabi Niazi

Actionable Strategies – Increase Backlinks

Backlinks also known as one-way links or Inbound Links are links from one website to a page on another website. Google and other major search engines consider backlinks “votes” for a specific page.

One strategy to increase the chances of showing Google page 1 or higher rankings is to increase the number of backlinks from credible and authoritative sites.

Read More »
Structured-Data-and-Markup
SEO | Search Engine Optimization Blogs
Zabi Niazi

Structured Data and Schema Markup

In this blog, I’ll share what structured data is, how it’s useful for both Google and you, how to get started using it, and some best practices if you already do.

Read More »
On page SEO Checklist
SEM | Search Engine Marketing Blogs
Zabi Niazi

URL

What is URL  URL stands for Uniform Resource Locator. The location of a webpage or file on the internet. The URL for the website you

Read More »