Why is a website not scanned by On-Page API? Where can I check the reason why it was not crawled?
In some cases, our crawler is unable to scan websites you specify in the
target field of the On-Page API Task POST array. It may happen for the following reasons:
- Target website is unreachable and thus can’t be crawled;
- First crawled page contains the meta robots=”noindex” tag;
- Robots.txt forbids crawling the page (in such a case, you can solve the issue by setting
overridein the Task POST array);
- HTTP header of the page contains “X-Robots-Tag: noindex”;
- First crawled page has more than 10 redirects.
Note that you pay only for the pages that were scanned successfully. If a page wasn’t crawled for one of the listed reasons, you would get a refund for it.
However, please also note that 4xx and 5xx pages are scannable, and we spend resources to crawl them — thus, there will be no refunds for crawled pages that returned 4xx and 5xx errors. The same is the case when the crawler is blocked by the Cloudflare filter. In such instances, the API server returns the
invalid_page_status_code (4xx) error, so you will be charged for scanning one page.
Where to check the reason your website wasn’t crawled
To find out why your website wasn’t crawled, you should call the Summary endpoint using the ID of an unsuccessful task.
The response from the API server will contain the
extended_crawl_status line. It indicates why the crawler was unable to scan a particular website. The line can display the following values:
no_errors– no crawling errors were detected;
site_unreachable – our crawler could not reach a website and thus was not able to obtain a status code;
invalid_page_status_code– status code of the first crawled page >= 400;
forbidden_meta_tag– the first crawled page contains thetag;
forbidden_robots– robots.txt forbids crawling the page;
forbidden_http_header– HTTP header of the page contains “X-Robots-Tag: noindex” ;
too_many_redirects– the first crawled page has more than 10 redirects;
unknown— the reason is unknown.
When the bot is blocked by Cloudflare, the
extended_crawl_status displays the invalid_page_status_code value just as with 4xx and 5xx errors. To ensure your site wasn’t scanned due to Cloudflare limitations, find the
server line in the API response. It should display the