If you encounter difficulties with OnPage API, this article provides insights into several common issues along with troubleshooting tips.
DataForSEO crawler is blocked in robots.txt
A robots.txt file contains a set of rules, which crawlers have to follow when accessing a website. In particular, this file specifies what pages are allowed to crawl and what pages are disallowed. If the website’s robots.txt restricts DataForSEO from crawling the content, OnPage API won’t be able to return results for this target. However, there are certain measures you can take to resolve this issue.
Check the robots.txt file
If you have access to the target website, check its robots.txt for any disallow commands blocking access to crawlers. To allow the DataForSEO On-Page API to crawl your site, add the following lines to your robots.txt file:
User-agent: Mozilla/5.0 (compatible; RSiteAuditor)
– leave a blank space here
Disallow:
Override robots.txt
In the Task POST array, set the robots_txt_merge_mode
to override
. This is the mode to ignore website crawling restrictions and other robots.txt settings. When you set this field to override, you also need to specify the custom_robots_txt
field. There, you can create an alternative robots.txt file by setting a custom User Agent, or disallowing to crawl certain directories.
Check if the first crawled page is closed from indexing
The <meta name="robots" content="noindex, nofollow">
code on the starting URL of your website instructs DataForSEO crawler that it is not allowed to index or follow links on it. If a page contains the “noindex”, “nofollow”, or “none” tags, they will cause the crawling error. If you have access to the code of your target website, remove those tags in order to enable our bot to analyze it.
Change the first crawled page
If the first page to crawl is сlosed from indexing or has more than 10 redirects, you can change the first page to crawl by specifying its URL in the start_url
field of your Task POST request.
Follow the sitemap
Set crawl_sitemap_only
field value to true, thus only the pages that should be accessible will be crawled.
Crawler is blocked by IP or location
If you crawl multiple websites simultaneously, blocking may occur manifested by the site_unreachable
error. You can take the steps listed below to try and resolve this issue.
Switch the proxy pool
In the Task POST array, set the optional switch_pool
field to true, and our system will use additional proxy pools to obtain the requested data.
Check the language header
In the Task POST array, specify the optional accept_language
parameter – the language header for accessing the website (supports all locale formats). If you do not indicate it, some websites may deny access.
Whitelist the crawler’s IP addresses
DataForSEO crawler uses certain IP addresses to make requests. If these IPs are blocked, our crawler won’t be able to access a target website.
Below is the list of IPs that should be whitelisted:
94.130.93.30
168.119.141.170
168.119.99.190
168.119.99.191
168.119.99.192
168.119.99.193
168.119.99.194
68.183.60.34
134.209.42.109
68.183.60.80
68.183.54.131
68.183.49.222
68.183.149.30
68.183.157.22
68.183.149.129
Our bot is using standard 80 HTTP and 443 HTTPS ports to connect.
JavaScript and popups interference
Most of the web scrapers only check fixed website content, with, in the best case, provide only a partial audit of dynamic elements. With DataForSEO, you can analyze the dynamic JS content on the target pages (additionally charged). You can also block cookie popups that may interfere with the crawler.
Load JavaScript
When setting a task, set the enable_javascript
parameter to true
to load the scripts available on a target website. You can learn more in this Help Centre article.
Disable cookie popups
To disable the popup requesting cookie consent from the user, set the disable_cookie_popup
field to true.
Domain could not be resolved by DNS
Domain Name System (DNS) converts domain names into numbers called IP addresses. DNS error happens when the Domain Name System server contacted during the web page loading is unable to find the site containing the requested page. If the target domain could not be resolved by DNS, it is probably offline.
Make sure that your nameservers are correct
A nameserver is a place where DNS records are stored. Web browsers search and read those records telling them where to find websites. You can verify your nameserver using free online tools, such as who.is.
Check the domain validity
Ensure it is registered and not expired, e.g., using the Whois lookup tool.
Check if the A records point to the correct IP
The error may happen when you do not use nameservers or manually change the DNS records. The A record specifies IP address for a given host. That is, it maps domain names to IP addresses (IPv4).
Add proper redirects
Users sometimes enter a root domain without realizing that there is no root domain version of their website, and they are supposed to enter the www version instead.
To avoid this, the website owner can add a redirect from the unsecured “example.com” to the secured “www.example.com” existing on the server. And a way round, this can happen if someone’s root domain is secured, but their www version is not. Then it would be necessary to redirect the www version to the root domain.
In the Task POST array, you can check if the requested domain implemented the www to non-www redirection by setting the enable_www_redirect_check
field to true
. You can find the result of this check in the test_www_redirect
field of the Summary endpoint.
Where do I find the reason my website wasn’t crawled?
To check the reason your website wasn’t analyzed, call the Errors endpoint of OnPage API.
The response from will contain the extended_crawl_status
line indicating why the crawler could not scan a certain page. The line can take the following values:
no_errors
– no crawling errors were detected;site_unreachable
– our crawler could not reach a website and thus was not able to obtain a status code;invalid_page_status_code
– status code of the first crawled page >= 400;forbidden_meta_tag
– the first crawled page contains the tag;forbidden_robots
– robots.txt forbids crawling the page;forbidden_http_header
– HTTP header of the page contains “X-Robots-Tag: noindex”;too_many_redirects
– the first crawled page has more than 10 redirects;unknown
— the reason is unknown.
When the DataForSeoBot is blocked by Cloudflare service, the extended_crawl_status
displays the invalid_page_status_code
value just as with the 4xx and 5xx errors. If your website wasn’t scanned due to Cloudflare limitations, the server line in the response would display the cloudflare
string.
Charging Notes
With DataForSEO you pay only for the pages that were scanned successfully. If a page wasn’t crawled for one of the reasons listed above, you will get a refund. However, keep in mind that pages with 4xx and 5xx status codes are scannable, and take our resources to crawl them, therefore, there will be no refunds for them. The same refers to blocking by the Cloudflare filter.
You can learn more about the cost of all additional On-Page API parameters in this Help Center article.