Can I ignore the sitemap when crawling the website? What if my website has more pages than the sitemap includes?
Yes, you can. In fact, our crawler ignores sitemaps by default.
In case you want to follow the page order indicated in the primary sitemap of a website, you should add the respect_sitemap
field to the On-Page API Task POST body and set it to true
.
Example:
[
{
"target": "dataforseo.com",
"max_crawl_pages": 10,
"respect_sitemap": true
}
]
You can also follow the custom sitemap. To do this, add the custom_sitemap
field to the Task POST body and specify your custom sitemap’s URL in it. Please note that the respect_sitemap
field must be set to true.
Example:
[
{
"target": "dataforseo.com",
"max_crawl_pages": 10,
"respect_sitemap": true,
"custom_sitemap": "https://dataforseo.com/customsitemap.xml"
}
]
Also, note that when the respect_sitemap
field is set to true, our crawler doesn’t analyze the click depth of scanned pages, so the click_depth value
in the API response will equal 0
.
If you require the click depth data, set the respect_sitemap
field to false
or don’t add it to the Task POST body at all.
What if my website has more pages than the sitemap includes?
In case you set the respect_sitemap field
to true
and your website has more pages than specified in the sitemap, our crawler will first scan the pages in the sitemap and then other pages.