Title: | Interact with the datos.gob.es API to download public data from all of Spain |
---|---|
Description: | Easily interact with the API from http://datos.gob.es to download data over 19,000 files from all different provinces of Spain. |
Authors: | Jorge Cimentada [aut, cre], Jorge Lopez [aut] |
Maintainer: | Jorge Cimentada <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.1 |
Built: | 2024-12-08 03:31:33 UTC |
Source: | https://github.com/rOpenSpain/opendataes |
When new checks come up, add them in the same format: logical tests first and then add them to the if statement
data_list_correct(raw_json)
data_list_correct(raw_json)
raw_json |
Raw JSON response from datos.gob.es |
Extract access URL to the actual data from data_list
extract_access_url(data_list)
extract_access_url(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract data from a data_list.
extract_data(data_list, encoding, guess_encoding, ...)
extract_data(data_list, encoding, guess_encoding, ...)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
encoding |
The encoding passed to read (all) the files. Most cases should be resolved with either 'UTF-8', 'latin1' or 'ASCII'. |
guess_encoding |
A logical stating whether to guess the encoding. This is set to TRUE by default.
Whenever guess_encoding is set to TRUE, the 'encoding' argument is ignored. If |
... |
Arguments passed to |
get_data
will accept the end path of a data base and it will search for the access url.
If the dataset is either a csv, xls, xlsx or xml, then it will
attempt to read it. If it succeeds, it will return the data frame. If not, it will return
the data frame with only one column containing all available access URL's.
For example, this URL: http://datos.gob.es/es/catalogo/a02002834-numero-de-centros-segun-ancho-de-banda-de-la-conexion-a-internet-que-tiene-el-centro6 says that it has a XML file but once you click on the 'download' XML, it redirects to a JavaScript based website that has the table. This file unfortunately is unreadable to the package.
For example, elecciones2016.csv and elecciones2014.csv
extract_dataset_name(data_list)
extract_dataset_name(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract description from data_list
extract_description(data_list)
extract_description(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract the end path of the dataset that directs to datos.gob.es from a data_list
extract_endpath(data_list)
extract_endpath(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract keywords from data_list
extract_keywords(data_list)
extract_keywords(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract access languages available from data_list
extract_language(data_list)
extract_language(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract all metadata from a data_list
extract_metadata(data_list)
extract_metadata(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
The date is currently exported as a string but should be turned into a a date class
extract_modified_date(data_list)
extract_modified_date(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract the publisher code of the dataset from data_list
extract_publisher_code(data_list)
extract_publisher_code(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract the publisher name of the dataset from data_list
extract_publisher_name(data_list)
extract_publisher_name(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
The date is currently exported as a string but should be turned into a a date class
extract_release_date(data_list)
extract_release_date(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Extract URL from datos.gob.es from data_list
extract_url(data_list)
extract_url(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
For example, csv or xml
extract_url_format(data_list)
extract_url_format(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Make GET requests with repeated trials
get_resp(ch_url, attempts_left = 5, ...)
get_resp(ch_url, attempts_left = 5, ...)
ch_url |
A url, preferably from the |
attempts_left |
Number of attempts of trying to request from the website |
... |
Arguments passed to |
Make GET requests over several pages of an API
get_resp_paginated(ch_url, num_pages = 1, page = 0, ...)
get_resp_paginated(ch_url, num_pages = 1, page = 0, ...)
ch_url |
URL to request from, preferably from the |
num_pages |
Number of pages to request |
page |
The page at which the request should being. This should rarely be used |
... |
Arguments passed to |
the parsed JSON object as a list but inside the items
slots it contains all data lists obtained from the pages specified
in num_pages
Check if publisher is available in opendataes
is_publisher_available(data_list)
is_publisher_available(data_list)
data_list |
A data_list similar to resp$result$items[[1]] that contains information on a dataset |
Build a custom url using the httr url class
make_url(path, param, ...)
make_url(path, param, ...)
path |
the end path of the dataset of interest |
param |
arguments for a query |
... |
any other arguments to building the path correctly. See |
Explore datasets by keywords and publishers in https://datos.gob.es/
openes_keywords(keyword, publisher)
openes_keywords(keyword, publisher)
keyword |
A character string specifying a keyword to identify a data set. For example, 'vivienda'. |
publisher |
A character string with the publisher code. Should be only one publisher code.
See |
openes_keywords
works only for searching for one keyword for a given publisher. For example,
'viviendas' for the Ayuntamiento of Barcelona. If there are no matches for a keyword-publisher combination,
openes_keywords
will raise an error stating that there are no matches.
openes_keywords
returns a data frame with the following columns:
description: a short description of each of the matched datasets in Spanish (Spanish is set as default if available, if not, then the first non-spanish language is chosen).
publisher: the entity that publishes the dataset. See openes_load_publishers
for all available publishers.
is_readable: whether that dataset is currently readable by openes_load
.
See permitted_formats
for currently available formats.
path_id: the end path that identifies that dataset in the https://datos.gob.es/ API.
url: the complete url of the dataset in https://datos.gob.es/. Note that this URL is not the access URL to the dataset but to the dataset's homepage in https://datos.gob.es/.
In most cases the user will need to narrow down their search because the result of openes_keywords
will have too many datasets. Beware that for passing the result of this function to openes_load
the final
data frame needs to be narrowed down to only one dataset (that is, 1 row) and the structure needs to be the same
as from the original output of openes_keywords
(same column names, in the same order). See examples below.
A tibble
containing the matched datasets.
## Not run: library(dplyr) kw <- openes_keywords("vivienda", "l01080193") # Ayuntamiento de Barcelona kw # Notice how we narrow down to only 1 dataset dts <- kw filter(grepl("Precios", description)) openes_load('ASCII') # Notice that we had to specify the encoding because printing the dataset returns an error. # If that happens to you, try figuring out the encoding with readr::guess_encoding(dts$data[[1]]) # and specify the most likely encoding in `openes_load` dts$metadata dts$data ## End(Not run)
## Not run: library(dplyr) kw <- openes_keywords("vivienda", "l01080193") # Ayuntamiento de Barcelona kw # Notice how we narrow down to only 1 dataset dts <- kw filter(grepl("Precios", description)) openes_load('ASCII') # Notice that we had to specify the encoding because printing the dataset returns an error. # If that happens to you, try figuring out the encoding with readr::guess_encoding(dts$data[[1]]) # and specify the most likely encoding in `openes_load` dts$metadata dts$data ## End(Not run)
Extract data and metadata from a given data set of https://datos.gob.es/
openes_load(x, encoding = "UTF-8", guess_encoding = TRUE, ...)
openes_load(x, encoding = "UTF-8", guess_encoding = TRUE, ...)
x |
A |
encoding |
The encoding passed to read (all) the files. Most cases should be resolved with either 'UTF-8', 'latin1' or 'ASCII'. |
guess_encoding |
A logical stating whether to guess the encoding. This is set to TRUE by default.
Whenever guess_encoding is set to TRUE, the 'encoding' argument is ignored. If |
... |
Arguments passed to |
openes_load
can return two possible outcomes: either an empty list or a list with a slot called metadata
and another slot called data. Whenever the path_id
argument is an invalid dataset path, it will return an empty list.
When path_id
is a valid dataset path, openes_load
will return an a list with the two slots described above.
For the metadata slot, openes_load
returns a tibble
with most available metadata of the dataset.
The columns are:
keywords: the available keywords from the dataset in the homepage of the dataset.
language: the available languages of the dataset's metadata. Note that that this does not mean that the dataset is in different languages but only the metadata.
description: a short description of the data being read.
url: the complete url of the dataset in https://datos.gob.es/. Note that this URL is not the access URL to the dataset but to the dataset's homepage in https://datos.gob.es/.
date_issued: the date at which the dataset was uploaded.
date_modified: the date at which the last dataset was uploaded. If the dataset has only been uploaded once, this
will return 'No modification date available'
.
publisher: the entity that publishes the dataset. See openes_load_publishers
for all available publishers.
publisher_data_url: the homepage of the dataset in the website of the publisher. This is helpful to look at the definitions of the columns in the dataset.
The metadata of the API can sometimes be returned in an incorrect order. For example, there are cases when there are several languages available and the order of the different descriptions are not in the same order of the languages. If you find any of these errors, try raising the issue directly to https://datos.gob.es/ as the package extracts all metadata in the same order as it is.
Whenever the metadata is in different languages, the resulting tibble
will have
the same numer of rows as there are languages containing the different texts in different languages and
repeating the same information whenever it's similar across languages (such as the dates, which are language agnostic).
In case the API returns empty requests, both data and metadata will be empty tibble
's
with the same column names.
For the data slot, openes_load
returns a list containing at least one tibble
.
If the dataset being request has file formats that openes_load
can read (see permitted_formats
)
it will read those files. If that dataset has several files, then it will return a list of the same length
as there are datasets where each slot in that list is a tibble
with the data. If for
some reason any of the datasets being read cannot be read, openes_load
has a fall back mechanism
that returns the format that attempted to read together with the URL so that the user can try to read the
dataset directly. In any case, the result will always be a list with tibble
's
where each one could be the requested dataset (success) or a dataset with the format and url that attempted
to read but failed (failure).
Inside the data slot, each list slot containing tibble
's will be named according
to the dataset that was read. When there is more than one dataset, the user can then enter the website
in the url
column in the metadata slot to see all names of the datasets. This is handy, for example,
when the same dataset is repeated across time and we want to figure out which data is which from the slot.
The API of https://datos.gob.es/ is not completely homogenous because it is an aggregator
of many different API's from different cities and provinces of Spain. openes_load
can only read
a limited number of file formats but will keep increasing as the package evolves. You can check the available file formats
in permitted_formats
. If the file format of the requested path_id
is not readable, openes_load
will return a list with only one tibble
with all available formats with their respective data URL
inside the data slot so that users can read the manually.
In a similar line, in order for openes_load
to provide the safest behavior, it is very conservative in which
publisher it can read from https://datos.gob.es/. Because some publishers do not have standardized datasets,
reading many different publishers can become very messy. openes_load
currently reads files from selected
publishers because they offer standardized datasets which makes it safer to read. As the package evolves and the
data quality improves between publishers, the package will include more publishers. See the publishers that the
package can read in publishers_available
.
if path_id
is a valid dataset path, a list with two slots: metadata and data. Each slot
contains tibble
's that contain either metadata or the data itself. If path_id
is not a valid dataset path, it returns an empty list. See the details section for some caveats.
# For a dataset with only one file to read example_id <- 'l01080193-fecundidad-en-madres-de-15-a-19-anos-de-la-ciudad-de-barcelona1' some_data <- openes_load(example_id) # Print the file to get some useful information some_data # Access the metadata some_data$metadata # Access the data. Note that the name of the dataset is in the list slot. Whenever # there are different files being read, you might want to enter to homepage # of the dataset in datos.gob.es with some_data$metadata$url or directly # to the homepage of dataset at the publisher's website # some_data$metadata$publisher_data_url some_data$data # For a dataset with many files ## Not run: example_id <- 'l01080193-domicilios-segun-nacionalidad' res <- openes_load(example_id) # Note that you can see how many files were read in '# of files read' res # See how all datasets were read but we're not sure what each one means. # Check the metadata and read the description. If that doesn't do it, # go to the URL of the dataset from the metadata. res$data # Also note that some of the datasets were not read uniformly correct. For example, # some of these datasets were read with more columns or more rows. This is left # to the user to fix. We could've added new arguments to the `...` but that would # apply to ALL datasets and it then becomes too complicated. # Encoding problems long <- "l01080193-descripcion-de-la-causalidad-de-los-accidentes" string <- "-gestionados-por-la-guardia-urbana-en-la-ciudad-de-barcelona" id <- paste0(long, string) pl <- openes_load(id) # The dataset is read successfully but once we print them, there's an error pl$data $`2011_ACCIDENTS_CAUSES_GU_BCN_.csv` Error in nchar(x[is_na], type = "width") : invalid multibyte string, element 1 # This error is due to an encoding problem. # We can use readr::guess_encoding to determine the encoding and reread # This suggests an ASCII encoding library(readr) guess_encoding(pl$data[[1]]) pl <- openes_load(id, 'ASCII') # Success pl$data # For exploring datasets with openes_keywords and piping to openes_load library(dplyr) kw <- openes_keywords("turismo", "l01080193") # Ayuntamiento de Barcelona#' kw dts <- kw filter(is_readable == TRUE, grepl("Tipos de propietarios", description)) openes_load() dts$metadata dts$data ## End(Not run)
# For a dataset with only one file to read example_id <- 'l01080193-fecundidad-en-madres-de-15-a-19-anos-de-la-ciudad-de-barcelona1' some_data <- openes_load(example_id) # Print the file to get some useful information some_data # Access the metadata some_data$metadata # Access the data. Note that the name of the dataset is in the list slot. Whenever # there are different files being read, you might want to enter to homepage # of the dataset in datos.gob.es with some_data$metadata$url or directly # to the homepage of dataset at the publisher's website # some_data$metadata$publisher_data_url some_data$data # For a dataset with many files ## Not run: example_id <- 'l01080193-domicilios-segun-nacionalidad' res <- openes_load(example_id) # Note that you can see how many files were read in '# of files read' res # See how all datasets were read but we're not sure what each one means. # Check the metadata and read the description. If that doesn't do it, # go to the URL of the dataset from the metadata. res$data # Also note that some of the datasets were not read uniformly correct. For example, # some of these datasets were read with more columns or more rows. This is left # to the user to fix. We could've added new arguments to the `...` but that would # apply to ALL datasets and it then becomes too complicated. # Encoding problems long <- "l01080193-descripcion-de-la-causalidad-de-los-accidentes" string <- "-gestionados-por-la-guardia-urbana-en-la-ciudad-de-barcelona" id <- paste0(long, string) pl <- openes_load(id) # The dataset is read successfully but once we print them, there's an error pl$data $`2011_ACCIDENTS_CAUSES_GU_BCN_.csv` Error in nchar(x[is_na], type = "width") : invalid multibyte string, element 1 # This error is due to an encoding problem. # We can use readr::guess_encoding to determine the encoding and reread # This suggests an ASCII encoding library(readr) guess_encoding(pl$data[[1]]) pl <- openes_load(id, 'ASCII') # Success pl$data # For exploring datasets with openes_keywords and piping to openes_load library(dplyr) kw <- openes_keywords("turismo", "l01080193") # Ayuntamiento de Barcelona#' kw dts <- kw filter(is_readable == TRUE, grepl("Tipos de propietarios", description)) openes_load() dts$metadata dts$data ## End(Not run)
Request all available publishers from https://datos.gob.es/
openes_load_publishers()
openes_load_publishers()
a tibble
with two columns: publisher_code and publishers
openes_load_publishers()
openes_load_publishers()
Build a url with a complete begin-end date/ prefix URL
path_begin_end_date(start_date, end_date, param = NULL, ...)
path_begin_end_date(start_date, end_date, param = NULL, ...)
start_date |
Start date in YYYYMMDD format |
end_date |
End date in YYYYMMDD format |
param |
Extra parameters to add to the url. For this function, this is useless because there's not further paths to the distribution end point. Keeping the argument for consistency |
... |
Extra arguments passed to |
Build a url with a complete catalog/ prefix URL
path_catalog(path, param = NULL, ...)
path_catalog(path, param = NULL, ...)
path |
the end path of the dataset of interest |
param |
arguments for a query |
... |
any other arguments to building the path correctly. See |
Build a url with a complete catalog/dataset prefix URL
path_catalog_dataset(path, param = NULL, ...)
path_catalog_dataset(path, param = NULL, ...)
path |
the end path of the dataset of interest |
param |
arguments for a query |
... |
any other arguments to building the path correctly. See |
Build a url with an ID of a dataset
path_dataset_id(id, param = NULL, ...)
path_dataset_id(id, param = NULL, ...)
id |
dataset id from datos.gob.es such as 'l01080193-numero-total-de-edificios-con-viviendas-segun-numero-de-plantas' |
param |
Extra parameters to add to the url. |
... |
Extra arguments passed to |
Build a url with a complete datasets/ prefix URL
path_datasets(param = NULL, ...)
path_datasets(param = NULL, ...)
param |
Extra parameters to add to the url. For this function, this is useless because there's not further paths to the dataset end point. Keeping the argument for consistency |
... |
Extra arguments passed to |
Build a url with a complete distribution/ prefix URL
path_distribution(param = NULL, ...)
path_distribution(param = NULL, ...)
param |
Extra parameters to add to the url. For this function, this is useless because there's not further paths to the distribution end point. Keeping the argument for consistency |
... |
Extra arguments passed to |
Build a url to search for a given keyword in the datos.gob.es API
path_explore_keyword(keyword)
path_explore_keyword(keyword)
keyword |
A string with the keyword to build the path with. |
Build a url with a complete publishers/ prefix URL
path_publishers(param = NULL, ...)
path_publishers(param = NULL, ...)
param |
Extra parameters to add to the url. For this function, this is useless because there's not further paths to the publishers end point. Keeping the argument for consistency |
... |
Extra arguments passed to |
Current readable formats from https://datos.gob.es/
permitted_formats
permitted_formats
permitted_formats
permitted_formats
Available publishers that 'opendataes' can read
publishers_available
publishers_available
An object of class tbl_df
(inherits from tbl
, data.frame
) with 10 rows and 2 columns.
a tibble
with two columns: publishers and publisher_code
publishers_available
publishers_available
Translate publisher code to publisher name
translate_publisher(code)
translate_publisher(code)
code |
A publisher code |