One of my previous tutorials covered the basics of understanding and configuring the .htaccess file in WordPress. The robots.txt file is a special file just like the .htaccess file. However, it serves a very different purpose. As you might have guessed from the name, the robot.txt file is meant for bots. For example, bots from search engines like Google and Bing.
This tutorial will help you understand the basics of the robots.txt file and how to configure it for WordPress. Let's get started.
As I mentioned earlier, the robots.txt file is meant for scraping bots. These are mainly search engines but can include other bots as well.
You might already know that search engines find all the pages and content on your website by crawling it—moving from one page to another through links either on the page itself or in the sitemap. This allows them to collect data from your website.
However, there could be some pages on a website that you don't want the bots to crawl. The robots.txt file gives you the option to specify which page they are allowed to visit and which pages they shouldn't crawl.
Please note that the instructions you provide in the robots.txt file are not binding. This means that, although reputable bots like the Google search crawler will respect the limitations in robots.txt, some bots will probably ignore whatever you put in there and crawl your website anyway. Others might even use it to find links that you specifically don't want to be crawled and then crawl them.
Basically, it is not advisable to rely on this file to prevent malicious bots from scraping your website. It is more like a guide that good bots follow.
The robots.txt file is supposed to be in the root directory of your website. This is different than .htaccess files which can be placed in different directories. The robots.txt file only works if it is in the root directory and is exactly named robots.txt.
You can create this file manually and place it inside your web root directory if it doesn't already exist.
The robots.txt file will tell different bots what they should and should not crawl on your website. It uses a bunch of commands to do that. Three such commands that you will use very often are User-Agent
, Allow
and Disallow
.
The User-Agent
the command will identify the bots to which you want to apply the current set of Allow
and Disallow
commands. You can set it to *
to target all bots. You can also narrow down the list of bots by specifying values like Googlebot
and Bingbot
. These are some of the most common crawler bots for the Google and Bing search engines respectively. There are many others out there from different companies which you might want to target specifically.
The Allow
the command gives you the option to specify the webpage or directory on your website which the bots are free to access. Keep in mind that any values that you specifically need to be with respect to the root directory.
The Disallow
command on the other hand tells the bots that they shouldn't crawl the listed directory or webpage.
You are only allowed to provide one directory or webpage for each Allow
or Disallow
command. However, you can use multiple Allow
and Disallow
commands within the same set. Here is an example:
User-Agent: * Disallow: /uploads/ Disallow: /includes/ Allow: /uploads/images/ Disallow: /login.php
In the above example, we told the bots that they shouldn't crawl the contents of the uploads directory. However, we use the Allow
command to tell them to still crawl the images sub-directory found inside uploads.
Any bot will assume that it is allowed to crawl all pages that you have not explicitly disallowed. This means that there is no need for you to allow the crawling of directories one at a time.
You should also keep in mind that the values you provide are case-sensitive. The bots will treat uploads
and UPLOADS
as referring to different directories.
The robots.txt file can also contain links to one or more sitemaps on your website. This makes it easier for bots to find all the posts and web pages on your website that you want them to crawl.
It is important to be careful when you are creating a robots.txt file to go along with your WordPress website. This is because small mistakes or oversights can prevent the crawling of content on your website by search engines. All the work that you put into SEO will be in vain if the search engines can't even crawl it.
A good rule of thumb is to disallow as little as possible. One approach is to just put the following in your robots.txt file. This basically tells all the bots that they are free to crawl all content on the website.
User-agent: *
Another option is to use the following version which tells them to avoid crawling the wp-admin directory but still crawls all the other content on the website. We also provide a link to the sitemap of the website in this example but that is entirely optional.
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Sitemap: https://your-website.com/sitemap.xml
It is important to not be too aggressive with the Disallow
command and block access to CSS or JavaScript files that might affect the appearance of the content on the front end. Nowadays, search engines also look at many other aspects of a webpage like its appearance or the user-friendliness of the layout before they determine how the content should be ranked. Blocking them from accessing CSS or JavaScript files will result in issues sooner or later.
As I have mentioned before, the robots.txt file is not used to enforce any rules. The rules you specify in the file are only to be used for providing guidance to good and obedient bots. This basically means that you should not be using this file in order to restrict access to some content on your website. There are two common situations that you might face if you used a robots.txt file for this purpose.
Even though malicious bots won't follow the guidelines provided in robots.txt, they could still use it in order to figure out exactly what you don't want them to crawl. This could possibly inflict more damage if you were using this file as a security measure.
This file isn't helpful in preventing your web pages from appearing in search results either. The webpage you are trying to hide will still show up in search results but its description would simply say No information is available for this page. This can happen when you block Google from reading a certain page with the robots.txt file, but that page is still being linked to from somewhere else.
If you want to block a page from appearing in search results, Google recommends using the noindex
option in the HTTP response header or adding a noindex
meta tag to the HTML file.
There's an easy way to do this if you are using WordPress. Just go to Settings > Reading in the WordPress admin dashboard and then uncheck the Search engine visibility option.
Removing a webpage from search results requires you to take some other actions like removing the page itself from the website, password protecting it, or using the noindex
option for bots.
Similar to the robots.txt file, only well-behaved and trustworthy bots will respect the noindex
option, so if you want to secure sensitive information on your site, you'll need to do it another way. For example, you could password-protect that page, or remove it from your website entirely.
Our aim with this post was to introduce you to the basics of the robots.txt file so that you can get an idea of what this file does. After that, we discussed the optimum configuration of robots.txt with respect to WordPress. We also saw how to set the noindex
option for a page using the WordPress admin.
In the end, I would like to repeat just one more time that you should not be using robots.txt to block access to sensitive content on the website. This will usually have the opposite effect with malicious bots!