Why is a website not scanned by On-Page API? Where can I check the reason why it was not crawled?

Edward

4 years ago

In some cases, our crawler is unable to scan websites you specify in the target field of the On-Page API Task POST array. It may happen for the following reasons:

Target website is unreachable and thus can’t be crawled;
First crawled page contains the meta robots=”noindex” tag;
Robots.txt forbids crawling the page (in such a case, you can solve the issue by setting robots_txt_merge_mode to override in the Task POST array);
HTTP header of the page contains “X-Robots-Tag: noindex”;
First crawled page has more than 10 redirects.

Note that you pay only for the pages that were scanned successfully. If a page wasn’t crawled for one of the listed reasons, you would get a refund for it.

However, please also note that 4xx and 5xx pages are scannable, and we spend resources to crawl them — thus, there will be no refunds for crawled pages that returned 4xx and 5xx errors. The same is the case when the crawler is blocked by the Cloudflare filter. In such instances, the API server returns the invalid_page_status_code (4xx) error, so you will be charged for scanning one page.

Where to check the reason your website wasn’t crawled

To find out why your website wasn’t crawled, you should call the Summary endpoint using the ID of an unsuccessful task.

GET https://api.dataforseo.com/v3/on_page/summary/$id

The response from the API server will contain the extended_crawl_status line. It indicates why the crawler was unable to scan a particular website. The line can display the following values:

no_errors – no crawling errors were detected;
site_unreachable – our crawler could not reach a website and thus was not able to obtain a status code;
invalid_page_status_code – status code of the first crawled page >= 400;
forbidden_meta_tag – the first crawled page contains the meta robots=“noindex” tag;
forbidden_robots – robots.txt forbids crawling the page;
forbidden_http_header – HTTP header of the page contains “X-Robots-Tag: noindex” ;
too_many_redirects – the first crawled page has more than 10 redirects;
unknown — the reason is unknown.

When the bot is blocked by Cloudflare, the extended_crawl_status displays the invalid_page_status_code value just as with 4xx and 5xx errors. To ensure your site wasn’t scanned due to Cloudflare limitations, find the server line in the API response. It should display the cloudflare string.