We are getting close to creating 44 zettabytes of data by 2020. It is a tenfold increase from 4,4 ZB we had seven years ago. This growing data tsunami is unstoppable. The data is changing the world. It influences all spheres of our lives: the way we communicate, the way we do business and market our products or services. For marketing in particular data has become a significant decision-making factor.
Let’s put in an example. If we take a website, every person who viewed the page is a potential customer. The more visits you get, the higher the chances of profit growth. Gathering and analyzing specific data is what helps to build a winning SEO strategy and make your website more visible in the search results. But how can you get the data SEO experts are longing for?
Sure enough, the SaaS market is trying to adjust to the growing appetite of the SEO industry. It offers a variety of tools assisting the process of data extraction. These solutions are as different as the car types. To drive your business towards success, you need a reliable vehicle. Hence, you should be particularly careful about choosing one.
Besides, it is worth considering specific points before the “car” purchase. We will help you examine APIs and proxy servers as data collecting solutions to make sure that your business will have a peaceful journey.
First off, let’s observe the road we are to take.
Know Where You Are Going
Simply put, web scraping is extracting data from web pages. In technical terms, it is the process of retrieving data from the underlying HTML code. Of course, SEO analysis requires some specific information, e.g., keyword data such as search volume and CPC (cost per click). Moreover, we want to have the data in a structured form. How can this be done?
Typically, web scraping involves several processes:
- Web crawling: sending a request to a website, downloading a response in HTML.
- Parsing and data extraction: the required data is retrieved from the downloaded content.
- Data conversion: the data is cleaned and transformed into a structured format.
- Serializing and storing data: rewriting it into CSV, JSON, XML, etc. or uploading it into a database.
As you can see, this is quite a highway to take a trip on. What is more, apart from driving along the road, you have to bear in mind the established regulations. Unless you are ready to pay a couple of fines, get your license withdrawn or worse — be sued and go to jail.
Mind the Traffic Rules
CENDI (Commerce, Energy, NASA, Defense Information Managers Group) has prepared the concept of a Frequently Asked Questions document (FAQ) about copyright, according to the provision 2.1.4 of which:
From this, we can draw a logical conclusion, that scraping data does not violate any law. However, there also exist CFAA, DMCA, and Terms of Service, imposing limitations on what activity is considered legal or illegal. At this point, before scraping information, do not forget that everything has to be done ethically.
As an alternative to getting through these troublesome stages to obtain data and worrying about the legal part of the process, you can turn to third-party data sources. Well-known companies offering solutions for SEO value their reputation. It is on the data provider, not the client, to ensure gathering valid data in a flash and taking into account compliance with the traffic rules. This would mean not only buying a car but also hiring a personal driver. For that reason, it is a safer way to go, but making a choice is up to you.
Before switching on the ignition, let’s review how proxies and APIs operate and what role they play in building an SEO analytics software.
Familiarizing With the Proxy Vehicle
A proxy or a proxy server is like the car’s toning: your identity (IP address) is hidden behind the proxy IP. When you send a request, it first goes to the proxy server, which then repeats your request to the targeted website using proxy IP.
In this way, proxies allow to cope with several tasks important when scraping data:
- Avoiding getting blocked when sending a series of requests to a site
- Sending simultaneous requests to one website from several IP addresses
- Making requests from different geographical locations (or even devices, if you need to review the content displayed on a mobile phone, for instance).
SEO analysis requires gathering substantial data volumes on keywords positions, search volume, CPC, SERP ranking, and so on. The process of obtaining that data requires making numerous requests using different geo-specific locations where proxies seem to come in handy.
On the other hand, you are sure to hit such bumps and potholes as getting blocked and having to deal with captcha. When an IP address is blocked, you have to put the proxy on a waiting list for a few hours before using it again.
Besides, before acquiring proxies, the following factors have to be thoroughly considered:
- The type of IPs: datacenter, residential, or mobile.
- The quality of IPs: public, shared, or private dedicated.
- The refinement of your proxy management system.
Datacenter IPs are the most reliable and cheapest ones. But, there is a chance of buying proxies that are already blacklisted. Another major problem is that this type of proxies is not associated with the Internet service provider. If someone investigates the proxy, they will find out that you are a user with little effort.
Residential IPs are much more costly and may raise legal issues since a private network is used for scraping in this case. Mobile IPs are the most expensive and can arouse even more complicated problems with traffic police.
Let alone, public proxies, or “open proxies” should never be used for web scraping. They are unreliable because frequently, some data may be missing during the extraction process. In general, dealing with public proxies, you must pay particular attention to security configuration. Watch out for spreading malware or having a Zombie Computer. Also, password, browsed sites, and other private information leaks are possible through the techniques of TLS and SSL encrypted connections. It is better to go with dedicated proxies and paid solutions.
Trim Level, Fuel, and Maintenance
A “proxy vehicle” does not consist of proxies alone. To start the data gathering process, you also need a scraping library where you would run the script.
On the other hand, if you want less dev work, you can spend some more money on a cruise control solution. In other words, a web scraping bot — a digital “spider” which crawls through the targeted website in search of the “fly” (the data we requested), fetches it and pulls it into a database or spreadsheet.
Note, however, that in some cases, it is difficult to tell whether the scraping bot is legitimate or malicious. The latter ones belong to specific organizations, are identified with them (e.g., in HTTP header), and respect the robots.txt file. In contrast, malicious bots create a fake HTTP user agent and do not mind the restrictions the website owner has placed.
Along with this, some bots are difficult to maintain in case of any structural changes. Google modified its algorithm 3,234 times in 2018, which is about nine times a day. While you are figuring out how to pass by hurdles like captcha and not get in a traffic jam rotating your proxies, the real situation of the ranking results may get far away from the information you will finally succeed to receive.
As you understand, managing and troubleshooting the proxy pool is not as easy as turning on windshield wipers. Here are the fundamental challenges: maintaining the proxy infrastructure, throttling and rotating proxies, building a ban identification logic, managing the sessions. It requires a lot of time and resources. You can find proxy rotators to facilitate this process. Though the rotators only assist with altering IPs and locations. You would still have to hire staff for handling the rest of the issues.
Either way, fuelling and maintaining the “proxy vehicle” is unreasonable. It will cost you a lot of money and take months to get it jump-started. Consequently, the business itself may crash.
As an alternative, let’s review the solution DataForSEO has come up with.
API Vehicle. Your Roaring Success
People are striving to automate more and more aspects of life. For example, you must have heard about the “full self-driving” version of the Tesla Autopilot driver assistance system. The same would be incorporating APIs to develop an SEO tool. Except for, not that expensive. Here, at DataForSEO, we offer a flexible pay-per-use pricing model.
Notably, Tesla’s “hands-off” function still requires your readiness to take over the steering. Meanwhile, our REST APIs will allow you to forget about the steering wheel at all.
Application programming interface applied to SEO tools represents the “Navigate on Autopilot” feature of the vehicle:
- you put the destination in the navigator (specify the endpoint for a task: keywords for a domain, SERP results, etc.);
- the car gets you there (returns structured data in XML or JSON).
The received results can be combined into reports and categorized. Also, you can take advantage of our large-scale geo-targeting options and perform a bunch of requests at once. No super-knowledge or maintenance service needed to handle the car; you integrate our API into your tool and get the necessary information.
Frequently, people would want to embody their data into a spreadsheet. This visualized form is more convenient to perceive, analyze, and process. You might think that making this possible requires some knowledge of coding, but it doesn’t. Using proxies, you will need some experienced driver to assist “parking” that old pickup truck in your custom tool “garage.” At the same time, API allows you to benefit from an easy-to-use auto-park function.
DataForSEO APIs can be connected with Google Sheets using an API connector. A lot of them can be found for free on the web. At the same time, we are currently developing a customized Google Sheets version to improve the experiences of our clients even further.
Overall, we are always open to customers’ suggestions and feature requests. In case you need some guidance on how to use or integrate our service, we provide comprehensive documentation and 24/7 live support.
DataForSEO, being your Autopilot and Navigator, completely takes care of the driving process. We consider the traffic police and watch out for all the bumps on the road of extracting data for you. DataForSEO refined infrastructure guarantees its clients accurate information. Importantly, the speed of receiving the results is impressive: e.g., it takes about 20 seconds on average to return the data on Google top-100 results.
Furthermore, we have introduced pingbacks and postbacks: pingback is a notification of a completed task, while postback enables you to receive the data to the indicated URL. Postback may stand for Tesla “Summon” mode: the car drives up to you from the parking lot on its own. But there is more.
Our SERP API will supply you with data for any keyword, country, and language. Moreover, with Extra SERP, you can have statistics on such elements as featured snippets, knowledge graphs, top stories, and alike. Keyword Data API will be your 700 horsepower engine, providing up to 700 keyword suggestions on the specified domain, product category, or term. These are only a few examples, but we offer an extensive set of APIs and additional features for different SEO research purposes.
Long story short, the “API vehicle” proves much more time- and cost-efficient. Feel free to have a cup of coffee during the ride. Well, if you can drink your coffee in 20 sec. Because that is how fast you will receive the data.
All Things Considered
Picking a powerful and sound data-collecting machine to move your business forward is a difficult choice to make. One has to size up a lot of issues. If you opt for the right vehicle, you can cut your time, resources, and expenses. We have shortlisted the main points of proxy and API solutions for you in the table below. Choose wisely.
- Proxy Vehicle
- You are liable for any non-compliance with the established regulations
- Requires additional components for scraping and parsing
- Needs infrastructure and integration development
- A complicated and slow process of obtaining data
- Requires a team of experts
- Difficult maintenance
- Expenses are hard to calculate and turn out higher than expected
- Takes about a year of development
- DataForSEO API
- Your data provider is liable for any scraping abuse
- Completely loaded, hop-in-drive-off
- Ready to be integrated
- Takes about 20 seconds to provide the data
- Can be integrated by one person
- All maintenance is processed by us
- Clear and affordable pricing
- Reduces the work to several months