library(spanishoddata)
TL;DR (too long, didn’t read): For analysing more than 1 week of data, use spod_convert()
to convert the data into DuckDB
and spod_connect()
to connect to it for analysis using {dplyr}
. Skip to the section about it.
The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with spod_get()
. Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the Codebook and cookbook for v1 (2020-2021) Spanish mobility data and in the Codebook and cookbook for v2 (2022 onwards) Spanish mobility data.
There are two main ways to import the datasets:
as an in-memory object with spod_get()
;
as a connection to DuckDB or Parquet files on disk with spod_convert()
+ spod_connect()
. The latter is recommended for large datasets (more than 1 week), as it is much faster and more memory efficient, as we demonstarte below.
spod_get()
returns objects that are only appropriate for small datasets representing a few days of the national origin-destination flows. We recommend converting the data into analysis-ready formats (DuckDB
or Parquet
) using spod_convert()
+ spod_connect()
. This will allow you to work with much longer time periods (months and years) on a consumer laptop (with 8-16 GB of memory). See the section below for more details.
The mobility datasets available through {spanishiddata}
are very large. Particularly the origin-destination data, which contains millions of rows. These data sets may not fit into the memory of your computer, especially if you plan to run the analysis over multiple days, weeks, months, or even years.
To work with these datasets, we highly recommend using DuckDB
and Parquet
. These are systems for efficiently processing larger-than-memory datasets, while being user-firendly by presenting the data in a familiar data.frame
/tibble
object (almost). For a great intoroduction to both, we recommend materials by Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt: website, slides, and the video tutorial. You can also find examples of aggregating origin-destination data for flows analysis and visualisation in our vignettes on static and interactive flows visualisation.
Learning to use DuckDB
and Parquet
is easy for anyone who have ever worked with {dplyr}
functions such as select()
, filter()
, mutate()
, group_by()
, summarise()
, etc. However, since there is some learning curve to master these new tools, we provide some helper functions for novices to get started and easily open the datasets from DuckDB
and Parquet
. Please read the relevant sections below, where we first show how to convert the data, and then how to use it.
The main considerations to make when choosing between DuckDB
and Parquet
(that you can get with spod_convert()
+ spod_connect()
), as well as CSV.gz
(that you can get with spod_get()
) are analysis speed, convenience of data analysis, and the specific approach you prefer when getting the data. We discuss all three below.
The data format you choose may dramatically impact the speed of analysis (e.g. filtering by dates, calculating number of trips per hour, per week, per month, per origin-destination pair, and any other data aggregation or manipulation).
In our tests (see Figure 1), we found that conducting an analysis using DuckDB
database provided a significant speed advantage over using Parquet
and, more importantly, raw CSV.gz
files. Specifically, when comparing a query to determine the mean hourly trips over 18 months for each zone pair, we observed that using DuckDB
database was up to 5 times faster than using Parquet
files and up to 8 times faster than using CSV.gz
files.
For reference, here is a simple query we used for speed comparison in Figure 1:
# data represents either CSV files acquired from `spod_get()`, a `DuckDB` database or a folder of Parquet files connceted with `spod_connect()`
|>
data group_by(id_origin, id_destination, time_slot) |>
summarise(mean_hourly_trips = mean(n_trips, na.rm = TRUE),
.groups = "drop")
Figure 1 also shows that DuckDB
format will give you the best performance even on low-end systems with limited memory and number of processor cores, conditional on a fast SSD storage. Also note, that if you do choose to work with long time periods using CSV.gz files via spod_get()
, you will need to balance the amount of memory and processor cores via the max_n_cpu
and max_mem_gb
arguments, otherwise the analysis may fail (see the grey area in the figure), when there are too many parallel processes running at the same time with limited memory.
Regardless of the data format (DuckDB
, Parquet
, or CSV.gz
), the functions you will need for data manipulation and analysis are the same. This is because the analysis is actually performed by the DuckDB
(Mühleisen and Raasveldt 2024) engine, which presents the data as if it were a regular data.frame
/tibble
object in R (almost). So from that point of view, there is no difference between the data formats. You can manipulate the data using {dplyr}
functions such as select()
, filter()
, mutate()
, group_by()
, summarise()
, etc. In the end of any sequence of commands you will need to add collect()
to execute the whole chain of data manipulations and load the results into memory in an R data.frame
/tibble
. We provide examples in the following sections. Please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.
The choice between converting to DuckDB
and Parquet
could also be made based on how you plan to work with the data. Specifically whether you want to just download long periods or even all available data, or if you want to get the data gradually, as you progress through with the analysis.
If you plan to work with long time periods, we recommend DuckDB
, as it is one big file and it is easier to update it completely. For example you may be working with all 2020 data. Later you decide to add all of 2021 data. In this case it would be better to delete the database and create it from scratch.
If you only want certain dates, analyse them and add additional dates later, Parquet
may be better, as each day is saved in a separate file, just like the original CSV files. Therefore updating a folder of Parquet
files is as easy as just creating a new file only for the missing date.
If you only work with a few individual days, you may not notice the advantages of the DuckDB
or Parquet
formats. In this case, you can keep using the CSV.gz
format for the analysis using the spod_get()
function. This is also useful for quick tutorials, where you only need one or two days of data for demonstration purposes.
Make sure you have loaded the package:
library(spanishoddata)
Choose where {spanishoddata}
should download (and convert) the data by setting the data directory following command:
spod_set_data_dir(data_dir = "~/spanish_od_data")
The function above will also ensure that the directory is created and that you have sufficient permissions to write to it.
You can also set the data directory with an environment variable:
Sys.setenv(SPANISH_OD_DATA_DIR = "~/spanish_od_data")
The package will create this directory if it does not exist on the first run of any function that downloads the data.
To permanently set the directory for all projects, you can specify the data directory globally by setting the SPANISH_OD_DATA_DIR
environment variable, e.g. with the following command:
::edit_r_environ()
usethis# Then set the data directory globally, by typing this line in the file:
SPANISH_OD_DATA_DIR = "~/spanish_od_data"
You can also set the data directory locally, just for the current project. Set the ‘envar’ in the working directory by editing .Renviron
file in the root of the project:
file.edit(".Renviron")
spod_get()
As you might have seen in the codebooks for v1 and v2 data, you can get a single day’s worth of data as an in-memory object with spod_get()
:
<- c("2024-03-01")
dates <- spod_get(type = "od", zones = "distr", dates = dates)
d_1 class(d_1)
The output should look like this:
# Source: table<od_csv_clean_filtered> [?? x 19]
# Database: DuckDB v1.0.0 [... 6.5.0-45-generic:R 4.4.1/:memory:]
date time_slot id_origin id_destination distance activity_origin
<date> <int> <fct> <fct> <fct> <fct>
1 2024-03-01 19 01009_AM 01001 0.5-2 frequent_activity
2 2024-03-01 15 01002 01001 10-50 frequent_activity
Note that this is a lazily-evaluated in-memory object (note the :memory:
in the database path). This means that the data is not loaded into memory until you call collect()
on it. While this is useful for quick exploration of the data, we do not recommended this for large datasets, as we have demonstrated above.
DuckDB
databasePlease make sure you did all the steps in the Setup section above.
DuckDB
You can download and convert the data into DuckDB
database in two steps. For example, you select a few dates, and download the data manually (note: we use dates_2
to refer to the fact that we are using v2 data):
<- c(start = "2023-02-14", end = "2023-02-17")
dates_2 spod_download(type = "od", zones = "distr", dates = dates_2)
After that, you can convert any downloaded data (including the files that might have been downloaded previosly by running spod_get()
or spod_download()
with other dates or date intervals) into DuckDB
like so (dates = "cached_v2"
means use all downloaded files):
<- spod_convert(type = "od", zones = "distr", dates = "cached_v2", save_format = "duckdb", overwrite = TRUE)
db_2 # check the path to the saved `DuckDB` database db_2
The dates = "cached_v2"
(which can also be dates = "cached_v1"
for v1 data) argument instructs the function to only work with already-downloaded files. By default this resulting DuckDB
database for v2 origin-destination data for districts will be saved in the SPANISH_OD_DATA_DIR
directory under v2/tabular/duckdb/
with filename od_distritos.duckdb
(you can change this file path with the save_path
argument). The function returns the full path to the database file, which we save into db_2
variable. You can also any desired save location with the save_path
argument of spod_convert()
.
You can also convert any dates range or dates list to DuckDB
:
<- c(start = "2020-02-17", end = "2020-02-19")
dates_1 <- spod_convert(type = "od", zones = "distr", dates = dates_1, overwrite = TRUE) db_2
In this case, any missing data that has not yet been downloaded will be automatically downloaded, while 2020-02-17 will not be redownloaded, as we already requsted it when creating db_1
. Then the requested dates will be converted into DuckDB
, overwriting the file with db_1
. Once again, we save the path to the output DuckDB
database file into db_2
variable.
DuckDB
You can read the introductory information on how to connect to DuckDB
files here, however to simplify things for you we created a helper function. So to connect to the data stored in at path db_1
and db_2
you can do the following:
<- spod_connect(db_2) my_od_data_2
Just like before, with spod_get()
funciton that we used to download raw CSV.gz files and analyse them without any conversion, the resulting object my_od_data_2
is also a tbl_duckdb_connection
. So, you can treat it as regular data.frame
or tibble
and use {dplyr}
functions such as select()
, filter()
, mutate()
, group_by()
, summarise()
, etc. For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.
After finishing working with my_od_data_2
we advise that you “disconnect” this data using:
spod_disconnect(my_od_data_2)
This is useful to free-up memory and is neccessary if you would like to run spod_convert()
again and save the data to the same location. Otherwise, it is also helpful to avoid unnecessary possible warnings in terminal for garbage collected connections.
Parquet
Please make sure you did all the steps in the Setup section above.
Parquet
The process is exactly the same as for DuckDB
above. The only difference is that the data is converted to parquet
format and stored in SPANISH_OD_DATA_DIR
under v1/clean_data/tabular/parquet/
directory for v1 data (change this with the save_path
argument), and the subfolders are in hive-style format like year=2020/month=2/day=14
and inside each of these folders a single parquet
file will be placed containing the data for that day.
The advantage of this format is that you can “update” it quickly. For example, if you first downloaded the data for March and April 2020, converted this period into parquet
format, and then downloaded the data for May and June 2020, when you run the convertion function again, it will only convert the data for May and June 2020 and add it to the existing parquet
files. So you will save time and will not have to wait for March and April 2020 to be converted again.
Let us convert a few dates to parquet
format:
<- "od"
type <- "distr"
zones <- c(start = "2020-02-14", end = "2020-02-17")
dates <- spod_convert(type = type, zones = zones, dates = dates, save_format = "parquet") od_parquet
If we now request additional dates that overlap with the already converted data like so and specifiy argument overwrite = 'update'
we will update the existing parquet
files with the new data:
<- c(start = "2020-02-16", end = "2020-02-19")
dates <- spod_convert(type = type, zones = zones, dates = dates, save_format = "parquet", overwrite = 'update') od_parquet
That is, 16 and 17 Feboruary will not be converted again. Only the new data, that was not converted (18 and 19 February) will be converted, and these will be added to the existing folder structure ofparquet
files stored at the default save_path
location, which is <data_dir>/clean_data/v1/tabular/parquet/od_distritos
. Alternatively, you can set any other save location by setting the save_path
argument.
Parquet
Working with these parquet
files is exactly the same as with DuckDB
and Arrow
files. Just like before, you can use the same helper function spod_connect()
to connect to the parquet
files:
<- spod_connect(od_parquet) my_od_data_3
Mind you though, because we have first converted the data for a period between 14 and 17 February 2020, and then converted the data for a period between 16 and 19 February 2020 into the save default location, the od_parquet
contains the path to all this data, and therefore my_od_data_3
will connect you to all data.
You can check this like so:
|>
my_od_data_3 ::distinct(date) |>
dplyr::arrange(date) dplyr
For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.
To prepare origin-destination data v1 (2020-2021) for analysis over the whole period of data availability, please follow the steps below:
<- spod_get_valid_dates(ver = 1)
dates_v1 <- spod_get_valid_dates(ver = 2) dates_v2
Warning
Due to mobile network outages, the data on certain dates is missing. Kindly keep this in mind when calculating mean monthly or weekly flows.
Please check the original data page for currently known missing dates. At the time of writing, the following dates are missing: 26, 27, 30, 31 October, 1, 2 and 3 November 2023 and 4, 18, 19 April 2024. You can use
spod_get_valid_dates()
function to get all available dates.
Here the example is for origin-destination on district level for v1 data. You can change the type
to “number_of_trips” and the zones
to “municipalities” for v1 data. For v2 data, just use dates
starting with 2022-01-01 or the dates_v2
from above. Use all other function arguments for v2 in the same way as shown for v1, but also consult the v2 data codebook, as it has many more datasets in addition to “origin-destination” and “number_of_trips”.
<- "origin-destination"
type <- "districts"
zones spod_download(
type = type,
zones = zones,
dates = dates_v1,
return_local_file_paths = FALSE, # to avoid getting all downloaded file paths printed to console
max_download_size_gb = 50 # in Gb, this should be well over the actual download size for v1 data
)
<- "duckdb"
save_format
<- spod_convert_data(
analysis_data_storage type = type,
zones = zones,
dates = "cached_v1", # to just convert all data that was previously downloaded, no need to specify dates here
save_format = save_format,
overwrite = TRUE
)
This will convert all downloaded data to DuckDB
format for lightning fast analysis. You can change the save_format
to parquet
if you want to save the data in Parquet
format. For comparison overview of the two formats please see the Converting the data to DuckDB/Parquet for faster analysis.
By default, spod_convert_data()
will save the converted data in the SPANISH_OD_DATA_DIR
directory. You can change the save_path
argument of spod_convert_data()
if you want to save the data in a different location.
For this conversion, 4 GB or operating memory should be enough, the speed of the process depends on the number of processor cores and the speed of your disk storage. SSD is preferred. By default, the spod_convert_data()
will use all except one processor cores on your computer. You can adjust this with the max_n_cpu
argument of spod_convert_data()
. You can also increase the maximum amount of memory used with the max_mem_gb
argument, but this makes more difference during the analysis stage.
Finally, analysis_data_storage
will simply store the path to the converted data. Either a path to the DuckDB
database file or a path to the folder with Parquet
files.
For reference, converting the whole v1 origin-destination data to DuckDB
takes about 20 minutes with 4 GB of memory and 3 processor cores. The final size of the DuckDB
database is about 18 GB, in Parquet
format - 26 GB. The raw CSV files in gzip archives are about 20GB. v2 data is much larger, with origin-destination tables for 2022 - mid-2024 taking up 150+ GB in raw CSV.gz format.
You can pass the analysis_data_storage
path to spod_connect()
function, whether it is DuckDB
or Parquet
. The function will determine the data type automatically and give you back a tbl_duckdb_connection
1.
<- spod_connect(
my_data data_path = analysis_data_storage,
max_mem_gb = 16
)
Here we set max_mem_gb
to 16 GB. Generally, if you have more, feel free to increase it, but also consult the Figure 1 with our speed testing results in the Speed section. You can try other combinations of max_mem_gb
and max_n_cpu
arguments for your needs
Compared to conversion process, you might want to increase the available memory for the analysis step. The more, the better. You can control that with the max_mem_gb
argument.
You can manipulate my_data
using {dplyr}
functions such as select()
, filter()
, mutate()
, group_by()
, summarise()
, etc. In the end of any sequence of commands you will need to add collect()
to execute the whole chain of data manipulations and load the results into memory in an R data.frame
/tibble
. For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.
After finishing working with my_data
we advise that you “disconnect” to free up memory:
spod_disconnect(my_data)
For reference: this object also has classes: tbl_dbi
,tbl_sql
, tbl_lazy
,and tbl
.↩︎