Step-by-Step Guide to Creating a Custom Robots.txt for Experience Cloud Sites

Step-by-Step Guide to Creating a Custom Robots.txt for Experience Cloud Sites

Big Idea or Enduring Question:

  • What is a robots.txt file, and how can you create a custom robots.txt for your Experience Cloud site?

Objectives:

After reading this blog, you’ll be able to:

  1. Understand the role of robots.txt in SEO.
  2. Create and customize a robots.txt file for Experience Cloud sites.
  3. Control search engine crawlers effectively.
  4. Improve SEO and indexing for your portal.
  5. And much more!

👉 Previously, I’ve shared several posts on effectively implementing key branding and SEO features for Experience Cloud sites. Why not check them out while you’re here?

  1. Step-by-Step Guide to Enabling User Self-Deactivation in Experience Cloud Sites
  2. Step-by-Step Guide to Write SEO-Friendly Titles & Descriptions for Experience Cloud Pages
  3. Step-by-Step Guide to Generate a Sitemap in Salesforce Experience Cloud

Pre-requisites

Before you start reading this website, make sure to read these blogs to gain a basic understanding of SEO, sitemaps, meta tags, etc.

  1. Step-by-Step Guide to Generate a Sitemap in Salesforce Experience Cloud
  2. Step-by-Step Guide to Hiding Experience Cloud Public Pages from Search Engine Indexing

Business Use case

Olivia Bennett, a Junior Developer at Gurukul on Cloud (GoC), is part of a team working on building an Experience Cloud site for the company’s help portal. The portal is branded with the URL: https://help.gurukuloncloud.com/ 

She has already:

  1. Branded the portal URL as described in this post.
  2. Configured Google Analytics™ for Experience Cloud Sites.
  3. Learned and implemented Meta Titles and Meta Descriptions to improve SEO.

Additionally, Olivia has a requirement to block crawlers from indexing the following media files:

  1. Images (.jpg, .png, .gif)
  2. Audio files (.mp3, .wav)
  3. Videos (.mp4, .avi)
  4. PDF documents (.pdf)

These files should be blocked from search engines to prevent unnecessary indexing and reduce server load because:

  • They have no valuable SEO content.
  • Search engines should not send traffic to them.
  • Indexing them could create bloat, negatively affecting SEO rankings.

Understanding the Basics: Crawling vs. Indexing in SEO

Crawling and indexing are two crucial steps in how search engines process web pages.

  1. Crawling is when search engine bots (like Googlebot) scan websites, following links to discover new and updated content. It helps search engines understand a site’s structure and gather information. However, just because a page is crawled doesn’t mean it will appear in search results.
  2. Indexing happens after crawling, where search engines analyze and store the content in their database. Only indexed pages can rank and appear in search results. Factors like meta tags, page quality, and duplicate content determine whether a page gets indexed. For effective SEO, ensuring pages are both crawled and indexed is essential.

Crawling vs. Indexing: Key Differences

Feature Crawling Indexing
Definition Process of discovering and scanning web pages. Process of storing and organizing pages in the search engine database.
Purpose Helps search engines find and understand web pages. Makes web pages searchable in Google or other search engines.
Controlled by Robots.txt, sitemaps, internal links. Meta tags (noindex), content quality, canonical tags.
Outcome Search engines recognize the page but may not show it in search results. Indexed pages can appear in search engine results.
Can be Blocked? Yes, via robots.txt or nofollow attributes. Yes, using noindex meta tag, canonical tags, or poor content quality.

Remember: If a page is not crawled, it can’t be indexed. If it’s crawled but not indexed, it won’t rank.

What is a Robots.txt File?

A robots.txt file is like a set of simple rules for search engines. It tells them which pages or parts of your website they can visit and which ones they should ignore. This way, only the best pages show up when people search for your website.

While major search engines like Google, Bing, and Yahoo recognize and respect robots.txt directives, it’s important to note that this file is not a foolproof method for preventing web pages from appearing in search results.

For your experience cloud site it is automatically generated and located at the root of your domain. Each domain has a unique robots.txt, so sites sharing a domain share the same file, while those on different domains (e.g., *.force.com, *.my.site.com, custom domains) have separate ones. The file uses Allow ( A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned) and Disallow (A directory or page, relative to the root domain, that you don’t want the user agent to crawl) directives to manage bot access, with only relative URLs being valid.

Doordash Help Portal Robots.txt Generated by Salesforce

How Does a Robots.txt File Work?

The robots.txt file tells search engines where your sitemap is located, while the sitemap tells them which pages to crawl and index. When search engine bots visit a website, they systematically explore pages by following links. Before crawling any content, they check for a robots.txt file.

The structure of a robots.txt file is simple. It consists of:

  1. User-agent: Specifies the search engine bot (e.g., Googlebot, Bingbot).
  2. Directives: Rules defining which URLs the bot should crawl or avoid.

For example, a basic robots.txt file might look like this:

User-agent: *
Disallow: /admin
Disallow: /private
Allow: /public
Sitemap: https://example.com/sitemap.xml

By using robots.txt, website owners can control crawler behavior, protect sensitive areas, and optimize how their site appears in search results.

How to Locate a Robots.txt File

To find a website’s robots.txt file, simply enter the site’s homepage URL in your browser, followed by /robots.txt.

Example: https://help.gurukuloncloud.com/robots.txt

Why Is Robots.txt Important?

Most websites don’t need a robots.txt file because Google typically finds and indexes the important pages while ignoring duplicates or unimportant content. However, there are three key reasons to use one:

  1. Block Non-Public Pages: Prevent indexing of staging sites, login pages, or internal search results to keep them out of search results.
  2. Maximize Crawl Budget: By blocking unimportant pages, search engines focus on your essential content, ensuring the best use of your crawl budget.
  3. Control Resource Indexing: For resources like PDFs and images, robots.txt works better than meta tags to prevent unwanted indexing.

Robots.txt vs. Meta Tags: When to Use Each?

Both robots.txt and meta tags are used to control search engine indexing, but they serve different purposes.

Use Case robots.txt Meta Tag
Block entire directories Yes No
Prevent search engines from crawling a page Yes No
Ensure a page is NOT indexed No Yes
Prevent a specific page from appearing in search results No Yes
Control how bots follow links on a page No Yes

For maximum control, use robots.txt and meta tags together. For example, block an entire directory in robots.txt, but also use a meta tag to ensure a specific page within it is never indexed.

Tip: For maximum control, use robots.txt and meta tags together.

How Many Robots.txt Files Does Experience Cloud Generate for Multiple Sites?

In Salesforce Experience Cloud, the number of robots.txt files generated depends on your site structure.

Sites with Different Custom Domains

If you have multiple Experience Cloud sites, each using a unique custom domain, Salesforce automatically generates a separate robots.txt file for each site. This allows you to define individual indexing rules tailored to each site’s needs.

Example:

  1. Help Portal: https://help.gurukuloncloud.com/robots.txt
  2. Partner Portal: https://partners.gurukuloncloud.com/robots.txt
  3. Customer Portal: https://customers.gurukuloncloud.com/robots.txt

Since each robots.txt file is domain-specific, you have full control over which pages search engines can crawl and index for each individual site.

Sites with the Same Domain (Subpath-Based)

If you have multiple Experience Cloud sites under the same domain but using different subpaths, they share a single robots.txt file. This means that the robots.txt rules apply collectively to all subpath-based sites, requiring careful configuration to ensure proper indexing control.

Example:

  1. Help Site: https://gurukuloncloud.com/help
  2. Partner Portal: https://gurukuloncloud.com/partners
  3. Customer Portal: https://gurukuloncloud.com/customers

Since these sites share the same domain, they do not have separate robots.txt files. Instead, a single shared robots.txt file applies to all subpath-based sites and is accessible at: https://gurukuloncloud.com/robots.txt

To prevent unintentional blocking of important pages, it’s essential to carefully define crawling rules, ensuring that each subpath is appropriately managed within the shared robots.txt file. This helps maintain optimal search engine visibility while protecting sensitive or restricted content.

Automation Champion Approach (I-do):

One last thing, another important reason to create a custom robots.txt file is when you have a custom sitemap. In Salesforce Experience Cloud sites, the default sitemap is automatically generated, but if you have a custom site structure, restricted content, or dynamic pages, you may need to define your own sitemap.

Now let’s come back to Olivia requirement to block crawlers from indexing the following media files:

  1. Images (.jpg, .png, .gif)
  2. Audio files (.mp3, .wav)
  3. Videos (.mp4, .avi)
  4. PDF documents (.pdf)

and the answer is we have to. create a custom robots.txt. 

Step 1: Create a Custom robots.txt File Using a Visualforce Page

  1. Click Setup.
  2. In the Quick Find box, type Visualforce Page.
  3. Select Visualforce Page, click on the, clicks on the New button. 
  4. Select Available for Lightning Experience, Experience Builder sites, and the mobile app.
  5. Copy the following code into your Visualforce page and remove only the grammar mistakes.
    
    
    <apex:page contentType="text/plain">
    # Default robots.txt for GurukulOnCloud sites
    
    # For use by salesforce.com
    User-agent: *  # Applies to all robots
    
    Allow: /  # Allow all
    
    # Block specific file types
    Disallow: /*.jpg$
    Disallow: /*.png$
    Disallow: /*.gif$
    Disallow: /*.mp3$
    Disallow: /*.wav$
    Disallow: /*.mp4$
    Disallow: /*.avi$
    Disallow: /*.pdf$
    
    # Sitemap location
    Sitemap: https://help.gurukuloncloud.com/s/sitemap.xml
    </apex:page>

  6. Save the changes.

Step 2: Update Robots.txt on Experience Workspaces

In this step, you’ll update the robots.txt file in Experience Workspaces by linking it to the custom Visualforce page. This ensures search engines follow your defined rules while indexing your Experience Cloud site.

  1. Open Experience Builder.
  2. Navigate to Administration Pages, then click Go to Force.com.
  3. On the Site Details page, click Edit.
  4. In the Site Robots.txt field, select the name of the Visualforce page that you created in step 1.
  5. Save the changes. 

If you have multiple Experience Cloud sites under the same subdomain, you typically need to update the robots.txt file for each site individually within Experience Workspaces. Make sure that the robots.txt file includes sitemaps for all Experience Cloud sites that share the same subdomain.

Proof of Concept

To validate the new robots.txt, open this URL: https://help.gurukuloncloud.com/robots.txt. You should see the new robots.txt file we just implemented.

Formative Assessment:

I want to hear from you!

What is one thing you learned from this post? How do you envision applying this new knowledge in the real world? Feel free to share in the comments below.

← Back

Thank you for your response. ✨

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.