Robots.txt
  • 19 Jan 2023
  • 2 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Robots.txt

  • Dark
    Light
  • PDF

What is a robots.txt file?

A robots.txt file is a text file that is used to communicate with web crawlers and other automated agents about which pages of your knowledge base should not be indexed. It contains the rules such as which pages may be accessed by which crawlers.

For more information about robots.txt, see Google documentation.


How to access robots.txt in Document360?

1_Screenshot-Accessing_robotstxt_in_Document360

  1. Click SettingsKnowledge base siteArticle settings & SEOSEO
  2. In the Robots.txt section, click Edit
    2_Screenshot-Update_robotstxt_in_Document360
  3. Type in your desired rules
  4. Click Update

Use cases of robots.txt

Robots.txt file can block a folder, file (such as pdf), and file extensions from a crawler.

Pro-Tip
User-agent: *
Crawl-delay: 10

You can also delay the crawl speed of bots by adding crawl-delay in your robots.txt file. It comes in handy when your site is struggling with high traffic.


Sample files

User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml

User-agent: * - Specifies that any bot can crawl through our site.
Disallow: /admin/: - Restricts the crawler to crawl through the admin data of the site.
Sitemap: https://example.com/sitemap.xml - Provides access to the bots to crawl through the sitemap. This makes the crawl much easier for the bots as the sitemap contains all the URLs of the site.


User-agent: Bingbot 
Disallow: /

The above robots.txt file is defined to disallow the Bingbot.

User-agent: Bingbot - Specifies the crawler from the Bing search engine.
Disallow: / - Restricts the Bingbot from crawling the site.


Best Practices

  • Add the links of the most important pages
  • Block the links of pages that do not provide any value
  • Add sitemap location in the robots.txt file
  • Robots.txt file cannot be added twice. Please check the basic guidelines from Google Search Central documentation for more information

What are web crawlers?

A web crawler, also known as a spider or spiderbot, is a program or script which automatically navigates the web and collects information about various websites. Search engines like Google, Bing, Yandex, etc., use the crawlers for replicating a site's information to their servers.
Crawlers open new tabs and scroll around the website's content, just like we do when we view a website or webpage. Additionally, crawlers collect data or metadata from the website and other entities (such as links on a page, broken links, sitemaps, and HTML code) and send it to the servers of their particular search engine. Search engines use recorded information to effectively index the search results.


Was this article helpful?

What's Next