What is the difference between primary and secondary content in OnPage Content Parsing API?
The Content Parsing endpoint of OnPage API provides an opportunity to get structured content data from any page on the web.
If you checked the documentation of this endpoint, you might have noticed the parsed content is divided into “primary” and “secondary.” Each page contains "primary_content"
and "secondary_content"
sections. At the same time, the content of each section is broken into "primary_content"
and "secondary_content"
arrays.
In this Help Center article, we will walk through the algorithm that helps DataForSEO determine whether a certain piece of content has “primary” or “secondary” meaning on the page. Additionally, we’ll explain how the content pieces are grouped under the main and secondary topics of the page.
How are primary and secondary content pieces determined?
First, every piece of content is assigned a score that considers several criteria.
- Length of the word. Does it contain more than one character?
- The composition of the word. Do letters constitute more than half of the content?
In a nutshell, the number of words that match the above criteria determines the score of the content. In addition to this, the score for anchors is calculated following the same logic.
To determine whether a piece of content is “primary” or “secondary”, the threshold is set to three. Hence, content with a score greater than 3 and anchors taking less than 30% of the total volume is deemed “primary”. Content that doesn’t match this definition is therefore considered “secondary.”
Here’s an example of the "primary_content"
:
{ "text": "Federal Bank creates an API banking system to better integrate with other organizations and ecosystems", "url": null }
As you can see, it doesn’t contain anchor text and consists of letters.
Now, let’s take a look at “secondary” content.
{ "text": "Read the data sheet (883 KB)", "url": "https://www.ibm.com/downloads/cas/WJX036AD" }
In this case, all content is the anchor text for the download link, so it clearly falls into the "secondary_content"
category.
How are main and secondary topics determined?
The main_topic
and secondary_topic
arrays contain the structured content (primary, secondary, and table content) of the page’s body section. The difference is that the main_topic
array represents the section of the page’s body containing the main text content and consists predominantly of content pieces with high content scores. Similarly, the secondary_topic
array consists predominantly of content pieces with low content scores and represents the section of the page’s body containing secondary content.
To determine which parts of the page’s body content constitute the main and secondary topics, we use a heuristic evaluation that analyzes text segments:
1. The system divides the entire body content into distinct text segments;
2. Each segment receives a content score using the same criteria used to determine primary and secondary content (based on word length, composition, and anchor percentage);
3. Text segments are then grouped by their proximity and content scores;
4. Groups with predominantly high-scoring content become part of the main_topic
array;
5. Groups with predominantly low-scoring content become part of the secondary_topic
array.
This approach differs from the primary and secondary content distinction in that it focuses on identifying larger, coherent sections of the page rather than individual content pieces, providing a more structural view of the page’s content organization.