Site icon DataForSEO

Improve Your Research With Google Dataset API

There are tens of millions of datasets on the web, with content ranging from university studies data and government records, to results of scientific experiments and business reports. Google Dataset Search engine is a valuable source of datasets from all over the Web. Datasets are mostly used for research projects, training machine learning algorithms, and data visualization, but how you can easily scrape vast volumes of data for such purposes?

In this article, we will show you a solution for automatic dataset monitoring and pulling data from the Google Dataset Search. Besides, we will describe a few use cases for integrating API solutions into your research.

What is Google Dataset API?

The name of the Google Dataset Search speaks for itself – it is a search engine for datasets. It allows users to search for information in thousands of repositories across the Internet using simple keywords.

So how all the data is collected? In brief, Dataset Search relies on Google Web crawling to find pages that contain dataset metadata and to extract the corresponding triples. To share metadata dataset providers include schema.org and similar open standards markups to their Web pages, and the number of them is increasing every day.

When dealing with substantial datasets, automating data extraction will save you an impressive amount of time and effort. One of the most convenient ways to scrape vast data is by using an API.

DataForSEO offers a powerful Google Dataset API that allows accessing Google’s data and allows getting research insights and developing on top of it.

Now let’s talk about Google Dataset API endpoints and how to use them.

Get datasets by keyword for research

Using the Dataset Search endpoint you get the top 20 results of the Google Dataset Search engine. To make a request it is required to indicate a keyword. However, you can advance the search by adding filters.

Here are some useful filters that you can add to your POST request:

You can find possible values of the filters above and more about other parameters in our documentation.

Example request:

[
  {
    "keyword": "water quality",
    "last_updated": "1m",
    "file_formats": [
      "archive",
      "image"
    ],
    "usage_rights": "noncommercial",
    "is_free": true,
    "topics": [
      "natural_sciences",
      "geo"
    ]
  }
]

{
  "version": "0.1.20221214",
  "status_code": 20000,
  "status_message": "Ok.",
  "time": "5.0885 sec.",
  "cost": 0.002,
  "tasks_count": 1,
  "tasks_error": 0,
  "tasks": [
    {
      "id": "01161535-4426-0139-0000-eca053a01ae6",
      "status_code": 20000,
      "status_message": "Ok.",
      "time": "5.0310 sec.",
      "cost": 0.002,
      "result_count": 1,
      "path": [
        "v3",
        "serp",
        "google",
        "dataset_search",
        "live",
        "advanced"
      ],
      "data": {
        "api": "serp",
        "function": "live",
        "se": "google",
        "se_type": "dataset_search",
        "keyword": "water quality",
        "last_updated": "1m",
        "file_formats": [
          "archive",
          "image"
        ],
        "usage_rights": "noncommercial",
        "is_free": true,
        "topics": [
          "natural_sciences",
          "geo"
        ],
        "device": "desktop",
        "os": "windows"
      },
      "result": [
        {
          "keyword": "water quality",
          "se_domain": "datasetsearch.research.google.com",
          "language_code": "en",
          "check_url": "https://datasetsearch.research.google.com/search?query=water%20quality&hl=en&filters=WyJbXCJ1cGRhdGVkX2RhdGVcIixbXCIxbVwiXV0iLCJbXCJmaWxlX2Zvcm1hdF9jbGFzc1wiLFtcIjdcIixcIjVcIl1dIiwiW1wibGljZW5zZV9jbGFzc1wiLFtcIm5vbmNvbW1lcmNpYWxcIl1dIiwiW1wiaXNfYWNjZXNzaWJsZV9mb3JfZnJlZVwiLFtdXSIsIltcImZpZWxkX29mX3N0dWR5XCIsW1wibmF0dXJhbF9zY2llbmNlc1wiLFwiZ2VvXCJdXSJd",
          "datetime": "2023-01-16 13:35:35 +00:00",
          "spell": null,
          "item_types": [
            "dataset"
          ],
          "se_results_count": 11,
          "items_count": 11,
          "items": [
            {
              "type": "dataset",
              "rank_group": 1,
              "rank_absolute": 1,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFteHkyemdtNg==",
              "title": "Logan River Observatory: Right Hand Fork above confluence with Logan River Aquatic Site (RHF_CONF_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-27 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Right Hand Fork above confluence with Logan River"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on Right Hand Fork above confluence with Logan Rive r(RHF_CONF_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 2,
              "rank_absolute": 2,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFuMDQ3X3B6aA==",
              "title": "Lake Simcoe Monitoring",
              "image_url": null,
              "scholarly_citations_count": 31,
              "links": [
                {
                  "type": "link_element",
                  "title": "canada.ca",
                  "description": null,
                  "url": "http://open.canada.ca/",
                  "domain": "open.canada.ca"
                },
                {
                  "type": "link_element",
                  "title": "arctic-sdi.org",
                  "description": null,
                  "url": "http://catalogue.arctic-sdi.org/",
                  "domain": "catalogue.arctic-sdi.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "Government of Ontario",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "pdf",
                  "size": null
                },
                {
                  "type": "formats_element",
                  "format": "html",
                  "size": null
                },
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": null,
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Open Government Licence - Canada 2.0",
                  "url": "https://open.canada.ca/en/open-government-licence-canada",
                  "domain": "open.canada.ca"
                }
              ],
              "updated_date": "2022-12-30 02:00:00 +00:00",
              "area_covered": null,
              "period_covered": {
                "start_date": "1980-01-01 03:00:00 +00:00",
                "end_date": "2021-12-31 02:00:00 +00:00",
                "displayed_date": "Jan 1, 1980 - Dec 31, 2021"
              },
              "dataset_description": {
                "text": "The Lake Simcoe lake monitoring program provides measurements of chemical and physical water quality limits such as total phosphorus, nitrogen, chlorophyll a, pH, alkalinity, conductivity, dissolved organic and inorganic carbon, silica, other ions, water transparency, temperature and dissolved oxygen. Samples are collected biweekly during the spring, summer and fall. *[pH]: potential of hydrogen\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 3,
              "rank_absolute": 3,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFteHh6MjkwaA==",
              "title": "Logan River Observatory: Logan River at Wood Camp Bridge (LR_WCB_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://dataone.org/",
                  "domain": "dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-26 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Logan River at Wood Camp Bridge"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on the Logan River at the Wood Camp Bridge (LR_WCB_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 4,
              "rank_absolute": 4,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFtZnQzajlnbQ==",
              "title": "Earth Challenge 2020 Plastics: Raw Data",
              "image_url": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEZy0oMWhWvUyIPRL9Hsj3bz362JQgK3NRH7qNJs505n14FXyI",
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "kaggle.com",
                  "description": null,
                  "url": "http://www.kaggle.com/",
                  "domain": "www.kaggle.com"
                }
              ],
              "dataset_providers": null,
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": 32351259
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Jonathan K.",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "CC0 1.0 Universal Public Domain Dedication",
                  "url": "https://creativecommons.org/publicdomain/zero/1.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-18 02:00:00 +00:00",
              "area_covered": null,
              "period_covered": null,
              "dataset_description": {
                "text": "The Earth Challenge 2020 app collects data on macroplastic pollution, or plastic pollution visible to the naked eye. Volunteers take a picture to document plastic pollution found in the environment, and indicate whether they have recycled it, left it, or thrown it away. In addition, volunteers can classify these images in accordance with a standardized classification schema.\n\nThis is the raw version of this data set. A version that leverages the OGC SensorThings standard is forthcoming.\n\nData are partially validated. Images are flagged for adult or racy content, and adult or racy images are removed. Data will be validated as images of plastic pollution when an updated data set is published that includes volunteer classifications.\n\nIf you would like to contribute data to Earth Challenge 2020, please visit the project website.  \n\nThis is the raw version of this data set. A version that leverages the OGC SensorThings standard is forthcoming.\n\nSource Description: Source Text\n\nSource Dataset Image: Image Source\n",
                "links": [
                  {
                    "type": "link_element",
                    "title": "Source Text",
                    "description": null,
                    "url": "https://www.google.com/url?q=https%3A%2F%2Fearthchallenge2020.earthday.org%2Fdatasets%2Fd5bb4e8642544bbd9a79e9346ed4dd78_0%3Fgeometry%3D34.096%252C-40.901%252C-16.881%252C61.320&source=datasetsearch",
                    "domain": null
                  },
                  {
                    "type": "link_element",
                    "title": "Image Source",
                    "description": null,
                    "url": "https://www.google.com/url?q=https%3A%2F%2Fwww.swissinfo.ch%2Feng%2Frace-for-water-odyssey_using-drones-to-hunt-for-the-oceans--plastic-pollution%2F41379106&source=datasetsearch",
                    "domain": null
                  }
                ]
              }
            },
            {
              "type": "dataset",
              "rank_group": 5,
              "rank_absolute": 5,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFteHkzMzR0ZA==",
              "title": "Logan River Observatory: Logan River Above Wood Camp Aquatic Site (LR_WC_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://dataone.org/",
                  "domain": "dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-27 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Logan River Above Wood Camp"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on the Logan River Above Wood Camp (LR_WC_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 6,
              "rank_absolute": 6,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFyNGtscHd5dw==",
              "title": "Logan River Observatory: Temple Fork above confluence with Logan River Aquatic Site (TF_CONF_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-24 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Temple Fork above confluence with Logan River"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on Temple Fork above confluence with Logan River (TF_CONF_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 7,
              "rank_absolute": 7,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFwYzBiYzZyeg==",
              "title": "Logan River Observatory: Temple Fork below Sawmill Spring Aquatic Site (TF_SAWM_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": 8
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-27 02:00:00 +00:00",
              "area_covered": [
                "Temple Fork below Sawmill Spring",
                "North America",
                "Rocky Mountains",
                "Wasatch Range"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on Temple Fork below Sawmill Spring (TF_SAWM_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 8,
              "rank_absolute": 8,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFwYzBmM2Q4bA==",
              "title": "Logan River Observatory: Dewitt Springs above confluence with Logan River Aquatic Site (DS_CONF_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-27 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Dewitt Springs above confluence with Logan River"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on Dewitt Springs above confluence with Logan River (DS_CONF_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 9,
              "rank_absolute": 9,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFteHh6YjAxeQ==",
              "title": "Logan River Observatory: Spawn Creek above confluence with Temple Fork Aquatic Site (SPC_CONF_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-23 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Spawn Creek above confluence with Temple Fork"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on Spawn Creek above confluence with Temple Fork (SPC_CONF_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 10,
              "rank_absolute": 10,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFwYzA4cmhqeg==",
              "title": "Logan River Observatory: South Logan Benson Canal at Benson Irrigation Company Flume, 2300 North 600 West Aquatic Site (SLB_600W_CNL) Quality Controlled Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-27 02:00:00 +00:00",
              "area_covered": [
                "Logan",
                "North America",
                "Rocky Mountains",
                "South Logan Benson Canal at Benson Irrigation Company Flume",
                "2300 North 600 West"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains quality control level 1 (QC1) data for all of the variables measured for the aquatic site on the South Logan Benson Canal at Benson Irrigation Company Flume, 2300 North 600 West (SLB_600W_CNL). Each file contains all available QC1 data for a specific variable. Files will be updated as new data become available, but no more than once daily. These data have passed QA/QC procedures such as sensor calibration and visual inspection and removal of obvious errors. These data are approved by Technicians as the best available version of the data. See published script for correction steps specific to this data series. Each file header contains detailed metadata for site information, variable and method information, source information, and qualifiers referenced in the data. This site is currently operated as part of the Logan River Observatory.\n",
                "links": null
              }
            },
            {
              "type": "dataset",
              "rank_group": 11,
              "rank_absolute": 11,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFyNGtqbnMzeA==",
              "title": "Logan River Observatory: Logan River at Dewitt Springs Campground Aquatic Site (LR_DSC_A) Raw Data",
              "image_url": null,
              "scholarly_citations_count": null,
              "links": [
                {
                  "type": "link_element",
                  "title": "hydroshare.org",
                  "description": null,
                  "url": "http://www.hydroshare.org/",
                  "domain": "www.hydroshare.org"
                },
                {
                  "type": "link_element",
                  "title": "dataone.org",
                  "description": null,
                  "url": "http://search.dataone.org/",
                  "domain": "search.dataone.org"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "HydroShare",
                  "url": null,
                  "domain": null
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "zip",
                  "size": null
                }
              ],
              "authors": [
                {
                  "type": "authors_element",
                  "name": "Logan River Observatory",
                  "url": null,
                  "domain": null
                }
              ],
              "licenses": [
                {
                  "type": "licenses_element",
                  "title": "Attribution 4.0 (CC BY 4.0)",
                  "url": "https://creativecommons.org/licenses/by/4.0/",
                  "domain": "creativecommons.org"
                }
              ],
              "updated_date": "2022-12-26 02:00:00 +00:00",
              "area_covered": [
                "North America",
                "Rocky Mountains",
                "Wasatch Range",
                "Logan River at Dewitt Springs Campground"
              ],
              "period_covered": null,
              "dataset_description": {
                "text": "This dataset contains raw data for all of the variables measured for the aquatic site on th eLogan River at Dewitt Springs Campground (LR_DSC_A). Each file contains a calendar year of data. The file for the current year is updated on a daily basis. The data values were collected by a variety of sensors at 15 minute intervals. The file header contains detailed metadata for the site and the variable and method of each column. This site is currently operated as part of the Logan River Observatory. Prior to 2018 this site was operated as part of the iUTAH GAMUT Network.\n",
                "links": null
              }
            }
          ]
        }
      ]
    }
  ]
}

Discover data on the exact dataset by ID

The Dataset Info endpoint is based on the same Google Dataset Search engine. The difference is that the result data is extracted from the dataset page that is displayed separately from the SERP. That means you can search for information about a particular dataset including its content, providers, licenses, and description.

To make a request you need to indicate the dataset ID. There are 2 ways to find it:
1 If your research is based on Google Dataset Search API’s results, you can find the "dataset_id" parameter in the "dataset" item.

Here we can see that the dataset ID is L2cvMTFxcGdsbDMwMQ==.

"items": [
            {
              "type": "dataset",
              "rank_group": 1,
              "rank_absolute": 1,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFxcGdsbDMwMQ==",
              "title": "Water Quality Data",
              "image_url": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRM49hCs8HKygZqHe5K9VDnBr1WN8JF1ZGdQMmtNdVFYu-D1Nao",
              "scholarly_citations_count": 2,
              {...}

2 You also could find it in the dataset URL. For example, here is a link to the “Water Quality Data” in the search query:
https://datasetsearch.research.google.com/search?hl=en&query=water%20quality&docid=L2cvMTFxcGdsbDMwMQ%3D%3D&filters=bm9uZQ%3D%3D

In this case, the dataset ID is L2cvMTFxcGdsbDMwMQ.

You could notice that in the API response, every ID has a “==” ending. Both ways, with or without the “==” ending, the dataset ID is accepted by our system in the request.

Example request:

[
  {
    "dataset_id": "L2cvMTFxcGdsbDMwMQ=="
  }
]

{
  "version": "0.1.20221214",
  "status_code": 20000,
  "status_message": "Ok.",
  "time": "4.0817 sec.",
  "cost": 0.002,
  "tasks_count": 1,
  "tasks_error": 0,
  "tasks": [
    {
      "id": "01161600-4426-0139-0000-3357191881e9",
      "status_code": 20000,
      "status_message": "Ok.",
      "time": "4.0241 sec.",
      "cost": 0.002,
      "result_count": 1,
      "path": [
        "v3",
        "serp",
        "google",
        "dataset_info",
        "live",
        "advanced"
      ],
      "data": {
        "api": "serp",
        "function": "live",
        "se": "google",
        "se_type": "dataset_info",
        "dataset_id": "L2cvMTFxcGdsbDMwMQ==",
        "device": "desktop",
        "os": "windows"
      },
      "result": [
        {
          "keyword": "L2cvMTFxcGdsbDMwMQ==",
          "se_domain": "datasetsearch.research.google.com",
          "language_code": "en",
          "check_url": "https://datasetsearch.research.google.com/search?docid=L2cvMTFxcGdsbDMwMQ%3D%3D&hl=en",
          "datetime": "2023-01-16 14:00:30 +00:00",
          "spell": null,
          "item_types": [
            "dataset"
          ],
          "se_results_count": 1,
          "items_count": 1,
          "items": [
            {
              "type": "dataset",
              "rank_group": 1,
              "rank_absolute": 1,
              "position": "left",
              "xpath": null,
              "dataset_id": "L2cvMTFxcGdsbDMwMQ==",
              "title": "Water Quality Data",
              "image_url": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRM49hCs8HKygZqHe5K9VDnBr1WN8JF1ZGdQMmtNdVFYu-D1Nao",
              "scholarly_citations_count": 2,
              "links": [
                {
                  "type": "link_element",
                  "title": "ca.gov",
                  "description": null,
                  "url": "http://data.ca.gov/",
                  "domain": "data.ca.gov"
                },
                {
                  "type": "link_element",
                  "title": "ca.gov",
                  "description": null,
                  "url": "http://data.cnra.ca.gov/",
                  "domain": "data.cnra.ca.gov"
                }
              ],
              "dataset_providers": [
                {
                  "type": "dataset_providers_element",
                  "title": "California Department of Water Resources",
                  "url": "http://www.water.ca.gov/",
                  "domain": "www.water.ca.gov"
                }
              ],
              "formats": [
                {
                  "type": "formats_element",
                  "format": "csv",
                  "size": null
                }
              ],
              "authors": null,
              "licenses": null,
              "updated_date": "2022-12-16 02:00:00 +00:00",
              "area_covered": null,
              "period_covered": null,
              "dataset_description": {
                "text": "The California Department of Water Resources (DWR) discrete “grab” water quality dataset contains DWR-collected, current and historical, chemical and physical parameters found in drinking water, groundwater, and surface waters throughout the state.\n",
                "links": null
              }
            }
          ]
        }
      ]
    }
  ]
}

How to use Google Dataset API in practice?

Different types of people are searching for datasets, from students looking for data covering their senior topic to business analysts and data scientists developing new tools. However, one thing they could converge on – in most cases, datasets are used for research.

Below we will cover some Google Dataset API use cases for dealing with a significant amount of datasets.

Research Projects

Conveying research always involves scraping a huge amount of data. A scientific approach to research is based on examining previous studies on the theme and it is important to make datasets widely available, allowing easy citation and access. But when it comes to large-scale analysis, you may spend months manually selecting each dataset you need.

With Google Dataset API you can build an automatic dataset monitoring system that will speed up the whole data analysis process. Moreover, the opportunity to indicate additional search parameters such as topics, file formats, and last updated date will make it easier to filter out unwanted data. It could be a powerful basis for developing data-driven types of tools.

For example, you can develop a marketing research tool by analyzing and integrating datasets of market research, numerous review platforms, customer behavior, and economic policy influence on the business in different countries. Our API will scrape sorted datasets that can be used by business analysts to make valuable insights.

Machine Learning

Datasets are an integral part of the field of Machine Learning (ML). Major advances in this field can result from progress in learning algorithms, computer hardware, and the availability of high-quality training datasets.

High-quality datasets are difficult and costly to produce even for unsupervised ML algorithms. If we are talking about supervised and semi-supervised learning, training datasets need to be labeled which takes a great amount of time and makes production twice harder.

That is why you can create an ecosystem with datasets containing tasks and labeled data for any type of ML algorithm based on the Google Dataset API. A lot of standard datasets are available on the Web for free, and for an advanced approach, you can always search for commercial datasets. This way you will save your money and time on developing datasets from scratch. In addition, you will be able to analyze the principle of compiling high-quality datasets, draw conclusions, and elaborate your own models.

Let’s take a look at the example of data application for water quality research. The Institute of Electrical and Electronics Engineers (IEEE) provides datasets on deep learning studies, ML algorithms comparison, and training datasets for ML used for scientific research. On the Google Dataset Search, you can find “Dataset for Assessing Water Quality for Drinking and Irrigation Purposes using Machine Learning Models”, which can be used to train and test ML models to detect the water quality by physico-chemical parameters. That is a vital step in expanding access to potable water for both scientific and commercial purposes.

The cost of using Google Dataset API

Four factors determine the cost of collecting data with Google Dataset API: the number of results you want to collect, the endpoint you will be using, the task execution method, and priority.

Using the Google Dataset Search API you will be billed for each SERP containing up to 20 results. You can specify a depth parameter in the POST request.
As for the Google Dataset Info API, you will be charged for every result. To calculate the cost of the result, multiply the price of SERP by 3. In this case, the price will depend only on the task execution method, and priority.

API has two main methods to deliver results – Standard and Live. The Standard method supports two task priorities – Normal and High. Chosen method and priority will determine the task execution time in frames from instant results to a guaranteed turnaround time of up to 45 minutes.

Now let’s describe the cost of using the endpoints.

Google Dataset Search

Method and priority Price per 20 blocks Price per 1M blocks
Live $0.002 $100
Standard Normal $0.0006 $30
Standard High $0.0012 $60

Note that our system processes 20 results in a row, so we recommend setting the depth in the multiples of 20. If you specify "depth": 21, you will be charged as per 40 results.

Google Dataset Info

Method and priority Price per 1 result Price per 1M results
Live $0.006 $6000
Standard Normal $0.0018 $1800
Standard High $0.0036 $3600

Conclusion

Scientists, governments, companies, and many others publish millions of datasets online. Google Dataset Search extracts dataset metadata from Web pages in order to make datasets discoverable. It is a valuable source for research, but there are no systems for automatically scraping Google’s data, and doing it manually is time-consuming and labor-intensive if you are dealing with a large number of datasets.

DataForSEO developed a solution – with Google Dataset API you can pull out vast volumes of data from Google Dataset Search. Integrating this API into your system will improve your research with minimal investment and give you an opportunity to build products on top of it.

Access up-to-date data sources for your research with rapid API results, reasonable pricing, and 24/7 support. Register to try our Google Dataset API for free!

Exit mobile version