What is a robots.txt file?
A robots.txt file is a text file that is used to communicate with web crawlers and other automated agents about which pages of your knowledge base should not be indexed. It contains the rules such as which pages may be accessed by which crawlers.
For more information about robots.txt, see Google documentation.
How to access robots.txt in Document360?
- Click Settings → Knowledge base site → Article settings & SEO → SEO
- In the Robots.txt section, click Edit
- Type in your desired rules
- Click Update
Use cases of robots.txt
Robots.txt file can block a folder, file (such as pdf), and file extensions from a crawler.
User-agent: * Crawl-delay: 10
You can also delay the crawl speed of bots by adding crawl-delay in your robots.txt file. It comes in handy when your site is struggling with high traffic.
User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap.xml
User-agent: * - Specifies that any bot can crawl through our site.
Disallow: /admin/: - Restricts the crawler to crawl through the admin data of the site.
Sitemap: https://example.com/sitemap.xml - Provides access to the bots to crawl through the sitemap. This makes the crawl much easier for the bots as the sitemap contains all the URLs of the site.
User-agent: Bingbot Disallow: /
The above robots.txt file is defined to disallow the Bingbot.
User-agent: Bingbot - Specifies the crawler from the Bing search engine.
Disallow: / - Restricts the Bingbot from crawling the site.
- Add the links of the most important pages
- Block the links of pages that do not provide any value
- Add sitemap location in the robots.txt file
- Robots.txt file cannot be added twice. Please check the basic guidelines from Google Search Central documentation for more information
What are web crawlers?
A web crawler, also known as a spider or spiderbot, is a program or script which automatically navigates the web and collects information about various websites. Search engines like Google, Bing, Yandex, etc., use the crawlers for replicating a site's information to their servers.
Crawlers open new tabs and scroll around the website's content, just like we do when we view a website or webpage. Additionally, crawlers collect data or metadata from the website and other entities (such as links on a page, broken links, sitemaps, and HTML code) and send it to the servers of their particular search engine. Search engines use recorded information to effectively index the search results.