How to extract blog metadata with On-Page API?
In simple terms, metadata provides information about the content of a web page, such as the title and description of a blog post. However, there are many types of metadata, and not all of them can be extracted with just another commercial crawler. At the same time, blog metadata is a valuable addition to the content audit functionality of any on-page SEO tool.
Here at DataForSEO, we have designed our On-Page API as a customizable crawling engine for measuring website technical health and content performance.
Using Custom JavaScript (custom_js
) in On-Page API, you can extract blog metadata and virtually any piece of information from any web page.
Custom JavaScript field is a parameter that allows tailoring data extraction to your specific needs. You should just input the necessary JavaScript code in this field, and upon crawl completion, you’ll find the requested data in the "custom_js_response"
array.
Below, we’ll walk you through extracting the author of the article, publication date, and comment count from a webpage.
First off, remember to register and get your API key that you should use for authentication.
Learn more about Authentication in our docs >>
After that, to post a task to On-Page API, you should use the following URL:
https://api.dataforseo.com/v3/on_page/task_post
1 At square one, define the required fields:
"target"
– target domain,
"max_crawl_pages"
– the number of pages to crawl.
If you want to crawl a single page, specify its URL in the "start_url"
field, and additionally set the "max_crawl_pages"
parameter to 1. You can also specify up to 20 pages for a priority crawl queue. To scan a list of pages before the rest of the pages, specify the necessary URLs in the "priority_urls"
array. Learn more here.
2 If the page you’re going to crawl is running on Javascript, it should be executed to crawl the content properly, so set “enable_javascript” to true.
Note that additional charges apply for the use of “enable_javascript” and “custom_js” parameters.
3 Use the “custom_js” parameter to input your Javascript code for custom data extraction. Note that the execution time for the script you enter here should be 700 ms maximum.
We’re going to extract the author of the article, publication date, and comment count from the following URL:
https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b
— So, first, we’re defining what elements should be extracted:
let meta = {authorArticle:'',publicationDate:'', commentCount:''};
— Then, we’re specifying the path to each element we need to obtain, defining that we require the text content from within the element located by that path:
meta.authorArticle = document.querySelector('div.es.aj.s > div > div > span > div > span > a').textContent;
meta.publicationDate = document.querySelector('div.es.aj.s > span > span > div > a').textContent;
meta.commentCount = document.querySelector('div > div.s > div > div > div > div > div > div > div > div > button').textContent
Note that you need to enter the JavaScript code to the “custom_js” field in DataForSEO On-Page API as a string.
Here’s what the snippet looks like in our example case:
{
"let meta = {authorArticle:'',publicationDate:'', commentCount:''}; meta.authorArticle = document.querySelector('div.es.aj.s > div > div > span > div > span > a').textContent; meta.publicationDate = document.querySelector('div.es.aj.s > span > span > div > a').textContent;meta.commentCount = document.querySelector('div > div.s > div > div > div > div > div > div > div > div > button').textContent; meta"
}
Below you can see a full example request.
POST: https://api.dataforseo.com/v3/on_page/task_post
[
{
"target": "medium.com",
"start_url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"max_crawl_pages": 1,
"enable_javascript": true,
"custom_js": "let meta = {authorArticle:'',publicationDate:'', commentCount:''}; meta.authorArticle = document.querySelector('div.es.aj.s > div > div > span > div > span > a').textContent; meta.publicationDate = document.querySelector('div.es.aj.s > span > span > div > a').textContent;meta.commentCount = document.querySelector('div > div.s > div > div > div > div > div > div > div > div > button').textContent; meta"
}
]
4 Once the crawl is completed, you can view the data requested by your JavaScript code in the “custom_js_response” array using the following endpoints:
Here’s the response we received with the Pages endpoint. The “custom_js_response” array in this example is highlighted in yellow.
{
"version": "0.1.20210917",
"status_code": 20000,
"status_message": "Ok.",
"time": "1.5488 sec.",
"cost": 0,
"tasks_count": 1,
"tasks_error": 0,
"tasks": [
{
"id": "09172009-1535-0216-0000-f64d2f5dc2b5",
"status_code": 20000,
"status_message": "Ok.",
"time": "1.4929 sec.",
"cost": 0,
"result_count": 1,
"path": [
"v3",
"on_page",
"pages"
],
"data": {
"api": "on_page",
"function": "pages",
"target": "medium.com",
"start_url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"max_crawl_pages": 1,
"enable_javascript": true,
"custom_js": "let meta = {authorArticle:'',publicationDate:'', commentCount:''}; meta.authorArticle = document.querySelector('div.es.aj.s > div > div > span > div > span > a').textContent; meta.publicationDate = document.querySelector('div.es.aj.s > span > span > div > a').textContent;meta.commentCount = document.querySelector('div > div.s > div > div > div > div > div > div > div > div > button').textContent; meta"
},
"result": [
{
"crawl_progress": "finished",
"crawl_status": {
"max_crawl_pages": 1,
"pages_in_queue": 0,
"pages_crawled": 1
},
"total_items_count": 1,
"items_count": 1,
"items": [
{
"resource_type": "html",
"status_code": 200,
"location": null,
"url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"meta": {
"title": "Stop this Microservices Madness. I’m a big fan of Microservices.\nDoes it… | by Federico Pugliese | The Startup | Medium",
"charset": 65001,
"follow": true,
"generator": null,
"htags": {
"h1": [
"Stop this Microservices Madness",
"Splitting is the way",
"Is it really new?",
"The Two Principles — The Good Tailor",
"When to split?",
"But why on Earth would I split my application into sub-processes?",
"Is it not your case? Go for a Monolith.",
"And remember, Monolith doesn’t mean Mess",
"Conclusions"
],
"h2": [
"1. Scalability for non-uniform traffic",
"2. Error resilience",
"3. Separate deployments",
"4. Complete isolation",
"5. Different requirements",
"Is Object-Oriented Programming not suitable for your application architecture? Or, better, is RESTful design not suitable? Microservices won’t be either!",
"The Startup",
"Sign up for Top 10 Stories",
"Federico Pugliese",
"The Startup",
"Federico Pugliese",
"The Startup",
"More From Medium",
"Should You Unit-Test in ASP.NET Core?",
"How to Write Bug-Free Code: Why Framing Matters",
"Manage Up: Take Charge of Your Own Growth",
"Python eval function — the right and wrong way",
"Supercharge your marketing with Slalom and Google Cloud",
"To Clone or Not to Clone?",
"Should You Unit Test?",
"To Keep Track of Reddit Discussions Around New York Times Content, We Built a Slack Bot",
"Learn more.",
"Make Medium yours.",
"Write a story on Medium."
],
"h3": [
"By The Startup"
]
},
"description": "Every now and then, a new paradigm or technology arises. It brings hope and hype. This will save us all!\nYeah, it’s true. Maybe it is ground-breaking. Extremely beneficial. However, like everything…",
"favicon": "https://miro.medium.com/1*m-R_BkNf1Qjr1YbyOIJY2w.png",
"meta_keywords": null,
"canonical": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"internal_links_count": 39,
"external_links_count": 29,
"inbound_links_count": 1,
"images_count": 18,
"images_size": 0,
"scripts_count": 42,
"scripts_size": 0,
"stylesheets_count": 0,
"stylesheets_size": 0,
"title_length": 119,
"description_length": 198,
"render_blocking_scripts_count": 2,
"render_blocking_stylesheets_count": 15,
"cumulative_layout_shift": 0,
"content": {
"plain_text_size": 9979,
"plain_text_rate": 0.05322672697500013,
"plain_text_word_count": 1712,
"automated_readability_index": 6.02838785046729,
"coleman_liau_readability_index": null,
"dale_chall_readability_index": 7.042250373831776,
"flesch_kincaid_readability_index": 57.84075700934582,
"smog_readability_index": 15.129267139461017,
"description_to_content_consistency": 1,
"title_to_content_consistency": 0.8333333134651184,
"meta_keywords_to_content_consistency": null
},
"deprecated_tags": null,
"duplicate_meta_tags": null,
"spell": null,
"social_media_tags": {
"twitter:app:name:iphone": "Medium",
"twitter:app:id:iphone": "828256236",
"al:ios:app_name": "Medium",
"al:ios:app_store_id": "828256236",
"al:android:package": "com.medium.reader",
"fb:app_id": "542599432471018",
"og:site_name": "Medium",
"og:type": "article",
"article:published_time": "2021-02-23T12:15:50.835Z",
"og:title": "Stop this Microservices Madness",
"twitter:title": "Stop this Microservices Madness",
"twitter:site": "@startitup_",
"twitter:app:url:iphone": "medium://p/8e4e0695805b",
"al:android:url": "medium://p/8e4e0695805b",
"al:ios:url": "medium://p/8e4e0695805b",
"al:android:app_name": "Medium",
"og:description": "I’m a big fan of Microservices.\nDoes it sound weird, given the title? Well, it should not.",
"twitter:description": "I’m a big fan of Microservices.\nDoes it sound weird, given the title? Well, it should not.",
"og:url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"al:web:url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
"og:image": "https://miro.medium.com/max/1200/0*vm-MCaL0fcjQgLZ1",
"twitter:image:src": "https://miro.medium.com/max/1200/0*vm-MCaL0fcjQgLZ1",
"twitter:card": "summary_large_image",
"article:author": "https://federicopugliese.medium.com",
"twitter:label1": "Reading time",
"twitter:data1": "6 min read"
}
},
"page_timing": {
"time_to_interactive": 0,
"dom_complete": 0,
"largest_contentful_paint": 0,
"first_input_delay": 0,
"connection_time": 0,
"time_to_secure_connection": 0,
"request_sent_time": 0,
"waiting_time": 0,
"download_time": 0,
"duration_time": 0,
"fetch_start": 0,
"fetch_end": 0
},
"onpage_score": 95.24,
"total_dom_size": 187666,
"custom_js_response": {
"authorArticle": "Federico Pugliese",
"publicationDate": "Jan 19",
"commentCount": "62 responses"
}
,
"resource_errors": {
"errors": null,
"warnings": [
{
"line": 0,
"message": "Has node with more than 60 childs.",
"status_code": 1
}
]
},
"broken_resources": false,
"broken_links": false,
"duplicate_title": false,
"duplicate_description": false,
"duplicate_content": false,
"click_depth": 0,
"size": 187666,
"encoded_size": 133258,
"total_transfer_size": 133258,
"fetch_time": "2021-09-17 20:09:32 +00:00",
"cache_control": {
"cachable": true,
"ttl": 0
},
"checks": {
"no_content_encoding": false,
"high_loading_time": false,
"is_redirect": false,
"is_4xx_code": false,
"is_5xx_code": false,
"is_broken": false,
"is_www": false,
"is_https": true,
"is_http": false,
"high_waiting_time": false,
"no_doctype": false,
"canonical": true,
"no_encoding_meta_tag": false,
"no_h1_tag": false,
"https_to_http_links": false,
"has_html_doctype": true,
"size_greater_than_3mb": false,
"meta_charset_consistency": false,
"has_meta_refresh_redirect": false,
"has_render_blocking_resources": true,
"redirect_chain": false,
"low_content_rate": true,
"high_content_rate": false,
"low_character_count": false,
"high_character_count": false,
"small_page_size": false,
"large_page_size": false,
"low_readability_rate": false,
"irrelevant_description": false,
"irrelevant_title": false,
"irrelevant_meta_keywords": false,
"title_too_long": true,
"title_too_short": false,
"deprecated_html_tags": false,
"duplicate_meta_tags": false,
"duplicate_title_tag": false,
"no_image_alt": true,
"no_image_title": true,
"no_description": false,
"no_title": false,
"no_favicon": false,
"seo_friendly_url": true,
"flash": false,
"frame": true,
"lorem_ipsum": false,
"seo_friendly_url_characters_check": true,
"seo_friendly_url_dynamic_check": true,
"seo_friendly_url_keywords_check": true,
"seo_friendly_url_relative_length_check": true,
"recursive_canonical": false,
"canonical_chain": false,
"canonical_to_redirect": false,
"canonical_to_broken": false,
"has_links_to_redirects": false,
"is_orphan_page": false,
"is_link_relation_conflict": true
},
"content_encoding": "gzip",
"media_type": "text/html",
"server": "cloudflare",
"is_resource": false
}
]
}
]
}
]
}
To learn more about the use cases for metadata extraction and to review more instructions, check out our blog post.
For more information on the cost of tasks with “enable_javascript” and “custom_js” parameters, please refer to this help article.