How does Perplexity follow robots.txt?

Perplexity respects robots.txt. Perplexity will not crawl full or partial text content of a news publisher that has disallowed PerplexityBot via robots.txt. Some news web pages may still be indexed even if a page is blocked via robots.txt. In this instance, only the website domain, headline, and a factual summary of the page are added to our search index.

If I allow my content to show up as a source in Perplexity, will it be used for AI training?

Our crawler, PerplexityBot, only indexes pages similar to the way any other search engine does. Unlike other AI companies, Perplexity does not build foundation models, so PerplexityBot does not scrape content for pre-training of LLMs.

If Perplexity respects robots.txt, why did I read online that Perplexity’s crawlers don’t respect it?

Previously, Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it. It was a very infrequent application but was designed to give users a way to summarize a large volume of text without using our file upload feature.

When a user prompted a specific URL, the user deployed our AI agent to scrape the URL on the user’s behalf, even if that web page had a robots.txt file in place. It was effectively the same as if the user went to a page themselves, copied the text of the article, and then pasted it into the system. The process had to be initiated by a user, URL by URL. We found that some users were abusing this feature in ways that violate our terms of service, so we have temporarily disabled this feature to not scrape a URL if it’s not in our search index, even if a user prompts the URL.

Separately, while PerplexityBot respects robots.txt, third-party web crawlers — which we use to help build our search index — were not always following robots.txt files. We have since made adjustments with our providers to ensure that they follow robots.txt when crawling on Perplexity’s behalf and never access full text content from disallowed news publisher sites.