How To Extract Blog Metadata Using JavaScript And On-Page API

Irene T.

5 years ago

blog_metadata_custom_javascript

Imagine that you could push your crawl configuration beyond the set of ready-made parameters. Now, what if we tell you that it’s actually possible with On-Page API?

Using the custom JavaScript field, you can extract virtually any piece of information from any web page.

In this article, we’ll walk you through extracting author, publication date, and comment count for content audits.

We’ll also share an actionable example of JavaScript code that you can modify for your own crawls. However, please, note that the code should be specific to your target website.

Why Content Audits?

In our recent research of the on-page SEO tools niche, we have found out that incorporating functionality for content analysis is one of the evolving trends. So we decided to take on this focus.

As an example use case, leveraging On-Page API and custom JavaScript, you can empower your on-page SEO tool with the following features for keeping track of content performance.

First off, you can provide users of your tool with a possibility to analyze the SEO performance of content by age based on publication date. Similarly, you can implement a feature for quickly spotting outdated content based on the extracted “last modified” date.

Sure enough, seeing the author for each article, your software users will be able to identify who’s responsible for updates at once. Also, if you want your content audits section to allow for teamwork, then you can also add a feature for creating page segments by author. In this way, each author will have access to a number of pages they should be keeping an eye on.

What’s more, by enriching the dashboard with comment count data, you’ll provide an instant view of articles that drive more engagement. Additionally, you can show social shares for the full picture.

If you also connect the flow of Google Analytics data here, you’ll be able to develop a tool similar to the one shown below.

https://dataforseo.com/wp-content/uploads/2021/02/content_audit_onpage_api.mp4

What Is Custom JS in DataForSEO On-Page API?

Here at DataForSEO, we perfectly understand that attaining your goals oftentimes requires very specific features and functions. That’s why we have designed our On-Page API as a customizable crawling engine for measuring website technical health and content performance.

Custom JavaScript (custom_js) in DataForSEO On-Page API is a parameter that allows tailoring data extraction to your specific needs. You should just input the necessary JavaScript code in this field, and upon crawl completion, you’ll find the requested data in the “custom_js_response” array. With this feature, our API lets you easily scan any website or web page for specific information, which you can retrieve hassle-free.

On-Page API also returns spell-check errors and suggestions and scans page content to provide you with five different readability index values.

Learn more in our docs >>

How to Extract Blog Metadata with DataForSEO

Now we got to the most interesting part where we’ll configure our request to DataForSEO On-Page API with the custom JavaScript parameter. For this example, we’ll be making a call through the Postman platform.

Remember that to post a task, you should use the following URL:

https://api.dataforseo.com/v3/on_page/task_post

1 At square one, we’ll define the required fields:

“target” – target domain,
“max_crawl_pages” – the number of pages to crawl.

If you want to crawl a single page, specify its URL in the “start_url” field, and additionally set the “max_crawl_pages” parameter to 1.

2 Next, if the page you’re going to crawl is running on Javascript, it should be executed to crawl the content properly, so set “enable_javascript” to true.

3 Here comes the custom_js parameter. As you understand, you should input your Javascript code for custom data extraction here. Note that the execution time for the script you enter here should be 700 ms maximum.

We’re going to extract the author of the article, publication date, and comment count from the following URL:

https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b

— So, first, we’re defining what elements should be extracted:

let meta = {authorArticle:'',publicationDate:'', commentCount:0}

Then, we’re specifying the path to each element we need to obtain, defining that we require the text content from within the element located by that path:

meta.authorArticle = document.querySelector("#root > div > div.s > article > div > section > div > div > div > div > div > div.o.n > div.eu.aj.s > div > div > span > div > span > a").textContent meta.publicationDate = document.querySelector("#root > div > div.s > article > div > section > div > div > div > div > div > div.o.n > div.eu.aj.s > span > span > div > a").textContent meta.commentCount = document.querySelector("#root > div > div.s > div:nth-child(7) > div > div.n.p > div > div.mv.n.en.z > div.n.li > button > div > div > h4").textContent

You can check the paths to the necessary elements of the target website through Chrome DevTools (choose an element to inspect – right-click on it in the console to copy selector or JS path). Sure enough, you can shorten the paths, which is what we’ll do in this example.

You can also check if your script returns the right data by entering it in the Chrome DevTools Console.

Note that you need to enter the JavaScript code to the “custom_js” field in DataForSEO On-Page API as a string.

Here’s what the snippet looks like in our example case:

"custom_js": "let meta = {authorArticle:'',publicationDate:'', commentCount:''}\n meta.authorArticle = document.querySelector('div.eu.aj.s > div > div > span > div > span > a').textContent\n meta.publicationDate = document.querySelector('div.eu.aj.s > span > span > div > a').textContent\n meta.commentCount = document.querySelector('div.mv.n.en.z > div.n.li > button > div > div > h4').textContent\n meta"

4 At this point, we can hit the “send” button to start the crawl.

Request Sample
POST: https://api.dataforseo.com/v3/on_page/task_post

[
    {
        "target": "medium.com",
        "start_url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
        "max_crawl_pages": 1,
        "enable_javascript": true,
        "custom_js": "let meta = {authorArticle:'',publicationDate:'', commentCount:''}\n meta.authorArticle = document.querySelector('div.eu.aj.s > div > div > span > div > span > a').textContent\n meta.publicationDate = document.querySelector('div.eu.aj.s > span > span > div > a').textContent\n meta.commentCount = document.querySelector('div.mv.n.en.z > div.n.li > button > div > div > h4').textContent\n meta"
    }
]

Response Sample
{
    "version": "0.1.20210129",
    "status_code": 20000,
    "status_message": "Ok.",
    "time": "0.1260 sec.",
    "cost": 0.001375,
    "tasks_count": 1,
    "tasks_error": 0,
    "tasks": [
        {
            "id": "02031338-1535-0216-0000-55445150c8ea",
            "status_code": 20100,
            "status_message": "Task Created.",
            "time": "0.0048 sec.",
            "cost": 0.001375,
            "result_count": 0,
            "path": [
                "v3",
                "on_page",
                "task_post"
            ],
            "data": {
                "api": "on_page",
                "function": "task_post",
                "target": "medium.com",
                "start_url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
                "max_crawl_pages": 1,
                "enable_javascript": true,
                "custom_js": "let meta = {authorArticle:'',publicationDate:'', commentCount:''}\n meta.authorArticle = document.querySelector('div.eu.aj.s > div > div > span > div > span > a').textContent\n meta.publicationDate = document.querySelector('div.eu.aj.s > span > span > div > a').textContent\n meta.commentCount = document.querySelector('div.mv.n.en.z > div.n.li > button > div > div > h4').textContent\n meta"
            },
            "result": null
        }
    ]
}

5 Once the crawl is completed, you can view the data requested by your JavaScript code in the “custom_js_response” field. You can find it in the results by the following endpoints:

Request Sample
POST: https://api.dataforseo.com/v3/on_page/pages

[
  {
    "id": "02031338-1535-0216-0000-55445150c8ea"
  }
]

Response Sample
{
    "version": "0.1.20210129",
    "status_code": 20000,
    "status_message": "Ok.",
    "time": "0.9288 sec.",
    "cost": 0,
    "tasks_count": 1,
    "tasks_error": 0,
    "tasks": [
        {
            "id": "02031338-1535-0216-0000-55445150c8ea",
            "status_code": 20000,
            "status_message": "Ok.",
            "time": "0.8446 sec.",
            "cost": 0,
            "result_count": 1,
            "path": [
                "v3",
                "on_page",
                "pages"
            ],
            "data": {
                "api": "on_page",
                "function": "pages"
            },
            "result": [
                {
                    "crawl_progress": "finished",
                    "total_items_count": 1,
                    "items_count": 1,
                    "items": [
                        {
                            "resource_type": "html",
                            "status_code": 200,
                            "location": null,
                            "url": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
                            "meta": {
                                "title": "Stop this Microservices Madness. I’m a big fan of Microservices. Does it… | by Federico Pugliese | The Startup | Jan, 2021 | Medium",
                                "charset": 65001,
                                "follow": true,
                                "generator": null,
                                "htags": {
                                    "h1": [
                                        "Stop this Microservices Madness",
                                        "Splitting is the way",
                                        "Is it really new?",
                                        "The Two Principles — The Good Tailor",
                                        "When to split?",
                                        "But why on Earth would I split my application into sub-processes?",
                                        "Is it not your case? Go for a Monolith.",
                                        "And remember, Monolith doesn’t mean Mess",
                                        "Conclusions"
                                    ],
                                    "h2": [
                                        "1. Scalability for non-uniform traffic",
                                        "2. Error resilience",
                                        "3. Separate deployments",
                                        "4. Complete isolation",
                                        "5. Different requirements",
                                        "Is Object-Oriented Programming not suitable for your application? Or, better, is RESTful design not suitable? Microservices won’t be either!",
                                        "The Startup",
                                        "Federico Pugliese",
                                        "The Startup",
                                        "Federico Pugliese",
                                        "The Startup",
                                        "More From Medium",
                                        "Genuinely useful career resources for self-taught developers",
                                        "What network engineers need to know about software engineering",
                                        "Decoupling Dataflow with Cloud Tasks and Cloud Functions",
                                        "Top 3 Django Gotchas to Catch during Code Review",
                                        "A Top Car Trading Platform Chooses a Scale-out Database as a MySQL Alternative",
                                        "Personal VPN server on AWS",
                                        "The Serverless Revolution",
                                        "How to make an iOS on-demand build system with Jenkins  and Fastlane",
                                        "Learn more.",
                                        "Make Medium yours.",
                                        "Share your thinking."
                                    ],
                                    "h4": [
                                        "Medium's largest active publication, followed by +760K people. Follow to join our community.",
                                        "3.6K",
                                        "44",
                                        "3.6K claps",
                                        "3.6K claps",
                                        "44 responses",
                                        "Cloud, DevOps and ML Engineer. Aspirant Writer.",
                                        "Medium's largest active publication, followed by +760K people. Follow to join our community.",
                                        "Cloud, DevOps and ML Engineer. Aspirant Writer.",
                                        "Medium's largest active publication, followed by +760K people. Follow to join our community.",
                                        "Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more",
                                        "Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore",
                                        "If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium",
                                        "AboutHelpLegal",
                                        "About",
                                        "Help",
                                        "Legal",
                                        "Get the Medium app"
                                    ]
                                },
                                "description": "Every now and then, a new paradigm or technology arises. It brings hope and hype. This will save us all!\nYeah, it’s true. Maybe it is ground-breaking. Extremely beneficial. However, like everything…",
                                "favicon": "https://miro.medium.com/1*m-R_BkNf1Qjr1YbyOIJY2w.png",
                                "meta_keywords": null,
                                "canonical": "https://medium.com/swlh/stop-this-microservices-madness-8e4e0695805b",
                                "internal_links_count": 35,
                                "external_links_count": 22,
                                "inbound_links_count": 0,
                                "images_count": 19,
                                "images_size": 0,
                                "scripts_count": 36,
                                "scripts_size": 0,
                                "stylesheets_count": 0,
                                "stylesheets_size": 0,
                                "title_length": 131,
                                "description_length": 198,
                                "content": {
                                    "plain_text_size": 10723,
                                    "plain_text_rate": 0.05904313017239955,
                                    "plain_text_word_count": 1838,
                                    "automated_readability_index": 5.51671337823133,
                                    "coleman_liau_readability_index": null,
                                    "dale_chall_readability_index": 6.88438226744699,
                                    "flesch_kincaid_readability_index": 59.29576952033645,
                                    "smog_readability_index": 14.585430642216284,
                                    "description_to_content_consistency": 1,
                                    "title_to_content_consistency": 0.699999988079071,
                                    "meta_keywords_to_content_consistency": null
                                },
                                "deprecated_tags": null,
                                "duplicate_meta_tags": null,
                                "spell": null
                            },
                            "page_timing": {
                                "time_to_interactive": 604,
                                "dom_complete": 604,
                                "connection_time": 17,
                                "time_to_secure_connection": 11,
                                "request_sent_time": 0,
                                "waiting_time": 574,
                                "download_time": 2,
                                "duration_time": 604,
                                "fetch_start": 0,
                                "fetch_end": 604
                            },
                            "total_dom_size": 181764,
                            "custom_js_response": {
                                "authorArticle": "Federico Pugliese",
                                "publicationDate": "Jan 19",
                                "commentCount": "44 responses"
                            },
                            "broken_resources": false,
                            "broken_links": false,
                            "duplicate_title": false,
                            "duplicate_description": false,
                            "duplicate_content": false,
                            "click_depth": 0,
                            "size": 181764,
                            "encoded_size": 137644,
                            "total_transfer_size": 137644,
                            "fetch_time": "2021-02-03 13:38:48 +00:00",
                            "cache_control": {
                                "cachable": true,
                                "ttl": 0
                            },
                            "checks": {
                                "no_content_encoding": false,
                                "high_loading_time": false,
                                "is_redirect": false,
                                "is_4xx_code": false,
                                "is_5xx_code": false,
                                "is_broken": false,
                                "is_www": false,
                                "is_https": true,
                                "is_http": false,
                                "high_waiting_time": false,
                                "no_doctype": false,
                                "canonical": true,
                                "no_encoding_meta_tag": false,
                                "no_h1_tag": false,
                                "low_content_rate": true,
                                "high_content_rate": false,
                                "low_character_count": false,
                                "high_character_count": false,
                                "small_page_size": false,
                                "large_page_size": false,
                                "low_readability_rate": false,
                                "irrelevant_description": false,
                                "irrelevant_title": false,
                                "irrelevant_meta_keywords": false,
                                "title_too_long": true,
                                "title_too_short": false,
                                "deprecated_html_tags": false,
                                "duplicate_meta_tags": false,
                                "duplicate_title_tag": false,
                                "no_image_alt": true,
                                "no_image_title": true,
                                "no_description": false,
                                "no_title": false,
                                "no_favicon": false,
                                "seo_friendly_url": true,
                                "flash": false,
                                "frame": false,
                                "lorem_ipsum": false,
                                "seo_friendly_url_characters_check": true,
                                "seo_friendly_url_dynamic_check": true,
                                "seo_friendly_url_keywords_check": true,
                                "seo_friendly_url_relative_length_check": true,
                                "recursive_canonical": false
                            },
                            "content_encoding": "gzip",
                            "media_type": "text/html",
                            "server": "cloudflare",
                            "is_resource": false
                        }
                    ]
                }
            ]
        }
    ]
}

As you can see, the script we entered returned the following response:

"custom_js_response": { "authorArticle": "Federico Pugliese", "publicationDate": "Jan 19", "commentCount": "44 responses" },

Now, repeating these simple several steps, you can configure your custom extraction specific for the target website.

How Much Does It Cost?

In our example page crawl, we used Load JavaScript and Custom JavaScript in addition to the basic set of parameters. Accordingly, the total cost of the page crawl includes the price of basic features and the price of those we additionally enabled.

$0.000125 (basic) + $0.001125 (load JS) + $0.000125 (custom JS) = $0.001375

For example, if you top up the account for $50, you can scan around:

Features Enabled	Basic + Custom JavaScript	Basic + Custom JavaScript + Load Resources	Basic + Custom JavaScript + Load JavaScript	All features: Basic + Custom JavaScript + Load Resources + Load JavaScript
Pages Scanned	200,000	100,000	36,363	30,769

In the table below, you can review the pricing for one page scan based on all possible combinations of the On-Page API features. To estimate the cost for your case, just multiply the number from the last column by the scope of pages you need to scan.

Basic	Load Resources	Load JavaScript	Custom JavaScript	Total cost
$0.000125	X	X	X	$0.000125
$0.000125	$0.00025	X	X	$0.000375
$0.000125	X	$0.001125	X	$0.00125
$0.000125	X	X	$0.000125	$0.00025
$0.000125	$0.00025	$0.001125	X	$0.0015
$0.000125	$0.00025	X	$0.000125	$0.0005
$0.000125	X	$0.001125	$0.000125	$0.001375
$0.000125	$0.00025	$0.001125	$0.000125	$0.001375

Check our pricing >>

Final Thoughts

As you can see, our On-Page API is very flexible, and you can use a variety of features to make the best of your crawling experience, and customize the crawls to your requirements.

Now that you know how to extract author, publication date, and comment count with the custom JavaScript parameter in our On-Page API, don’t hesitate to give it a try right away.
GO FOR IT

Exit mobile version