vignettes/V1_data_template.Rmd
V1_data_template.Rmd
There are many sources of travel data that researchers wish to fit models to. So, we have designed a generalized data frame template to standardize travel data from various sources into a long-form format that is compatible with the modeling and simulation tools in this package. The travel_data_sim()
object contains a simulated example to illustrate the structure of the data. This example data set contains simulated values of location information and observed number of trips among origin and destination locations and within home locations. The travel_data_template()
object is an empty template that can be populated from scratch.
Since the long-form data structure is designed to accomodate different types of data, some columns may be left blank. For example, in a travel survey the rows may represent indivdiduals compared with call data records where the rows may represent total trip counts for an origin and destination.
In terms of spatial data, if your data contain coordinate locations down to administrative level 3, then level 4 and 5 can be left blank and the functions will ignore them. Likewise, if all administrative units are in the same country, then admin_0
can be left blank.
str(travel_data_sim)
#> 'data.frame': 46 obs. of 28 variables:
#> $ date_start: Date, format: "2020-01-01" "2020-01-01" ...
#> $ date_stop : Date, format: "2020-01-08" "2020-01-08" ...
#> $ date_span : 'difftime' num NA NA NA NA ...
#> ..- attr(*, "units")= chr "days"
#> $ indiv_id : int NA NA NA NA NA NA NA NA NA NA ...
#> $ indiv_age : num NA NA NA NA NA NA NA NA NA NA ...
#> $ indiv_sex : logi NA NA NA NA NA NA ...
#> $ indiv_type: chr NA NA NA NA ...
#> $ orig_adm0 : chr "A" "A" "A" "A" ...
#> $ orig_adm1 : chr "B" "B" "B" "B" ...
#> $ orig_adm2 : chr "O" "S" "N" "C" ...
#> $ orig_adm3 : chr NA NA NA NA ...
#> $ orig_adm4 : chr NA NA NA NA ...
#> $ orig_adm5 : chr NA NA NA NA ...
#> $ orig_type : chr "County" "County" "County" "County" ...
#> $ orig_x : num -91.5 -89.4 -92.4 -89.1 -90.3 ...
#> $ orig_y : num 30.4 30.8 29.8 29.8 29.3 ...
#> $ orig_pop : num 6360 7515 2839 3961 609 ...
#> $ dest_adm0 : chr "A" "A" "A" "A" ...
#> $ dest_adm1 : chr "B" "B" "B" "B" ...
#> $ dest_adm2 : chr "G" "U" "L" "O" ...
#> $ dest_adm3 : chr NA NA NA NA ...
#> $ dest_adm4 : chr NA NA NA NA ...
#> $ dest_adm5 : chr NA NA NA NA ...
#> $ dest_type : chr "County" "County" "County" "County" ...
#> $ dest_x : num -90.2 -89.2 -89.6 -86.4 -87.6 ...
#> $ dest_y : num 30.8 29.8 31.1 31.2 30.2 ...
#> $ dest_pop : num 4048 7355 9542 8603 7596 ...
#> $ trips : num 2 79 2 0 23 13 247 0 6 7 ...
Variable | Class | Description |
---|---|---|
date_start | date | beginning of the time interval for the trip count |
date_stop | date | end of the time interval for the trip count |
date_span | integer | time span in days |
indiv_id | numeric | unique individual identifier |
indiv_age | numeric | age of participant |
indiv_sex | logical | gender of participant |
indiv_type | character | if individual participants belong to different groups |
orig_adm0 | character | name of highest administration level of origin location (Country) |
orig_adm1 | character | name of administration level 1 of origin location (e.g. Division, State) |
orig_adm2 | character | name of administration level 2 of origin location (e.g. District, County) |
orig_adm3 | character | name of administration level 3 of origin location (e.g. Sub-district, Province) |
orig_adm4 | character | name of administration level 4 of origin location (e.g. City, Municipality) |
orig_adm5 | character | name of administration level 5 of origin location (e.g. Town, Village, Community, Ward) |
orig_type | character | administrative type for the origin location (e.g. sub-district, community vs town, or urban vs rural) |
orig_x | numeric | longitude of origin location centroid in decimal degrees (centroid of smallest admin unit |
orig_y | numeric | latitude of origin location centroid in decimal degrees (centroid of smallest admin unit) |
orig_pop | numeric | population size of lowest administrative unit for origin location |
dest_adm0 | character | name of highest administration level of destination location (Country) |
dest_adm1 | character | name of administration level 1 of destination location (e.g. Division, State) |
dest_adm2 | character | name of administration level 2 of destination location (e.g. District, County) |
dest_adm3 | character | name of administration level 3 of destination location (e.g. Sub-district, Province) |
dest_adm4 | character | name of administration level 4 of destination location (e.g. City, Municipality) |
dest_adm5 | character | name of administration level 5 of destination location (e.g. Town, Village, Community, Ward) |
dest_type | character | administrative type for the destination location (e.g. sub-district, community vs town, or urban vs rural) |
dest_x | numeric | longitude of destination location in decimal degrees (centroid of smallest admin unit) |
dest_y | numeric | latitude of destination location centroid in decimal degrees (centroid of smallest admin unit) |
dest_pop | numeric | population size of lowest administrative unit for destination location |
trips | numeric | total number of observed trips made from origin to destination during time span |
This data template can be populated by starting with the travel_data_template
object and adding rows. The code below starts by adding information on trips from an origin to a destination.
# Travel among some locations
trip <- travel_data_template
n <- 30 # number of locations
trip[1:n,] <- NA # add rows for each location
# Time span of travel data
trip$date_start <- as.Date("2020-01-01")
trip$date_stop <- trip$date_start + 7
trip$date_span <- difftime(trip$date_stop, trip$date_start, units='days')
# Origin info: some counties within the same state
trip$orig_adm0 <- trip$dest_adm0 <- 'A' # Country
trip$orig_adm1 <- trip$dest_adm1 <- 'B' # State
trip$orig_adm2 <- sample(LETTERS, n, replace=T)
trip$dest_adm2 <- sample(LETTERS, n, replace=T)
trip$orig_type <- trip$dest_type <- 'County' # Type of admin unit for lowest admin level
# Some fake coordinates in decimal degrees
trip$orig_x <- rnorm(n, -90, 2)
trip$orig_y <- rnorm(n, 30, 1)
trip$dest_x <- rnorm(n, -90, 2)
trip$dest_y <- rnorm(n, 30, 1)
# Population sizes of the origins and destinations
trip$orig_pop <- rnbinom(n, size=5, mu=5000)
trip$dest_pop <- rnbinom(n, size=10, mu=10000)
trip$trips <- rnbinom(n, size=1, mu=100) # Number of reported trips
trip <- trip[!(trip$orig_adm2 == trip$dest_adm2),]
In some cases it may be easier to fill in stays (the number of trips within the origin or home location) in a different data frame and then merge the two.
# Stays in home location
stay <- travel_data_template
origins <- unique(c(trip$orig_adm2, trip$orig_adm2)) # all the
stay[1:length(origins),] <- NA
# Time span of travel survey
stay$date_start <- trip$date_start[1]
stay$date_stop <- trip$date_stop[1]
stay$date_span <- difftime(stay$date_stop, stay$date_start, units='days')
stay$orig_adm0 <- stay$dest_adm0 <- 'A' # Country
stay$orig_adm1 <- stay$dest_adm1 <- 'B' # State
stay$orig_adm2 <- stay$dest_adm2 <- origins
stay$orig_type <- stay$dest_type <- 'County'
for (i in 1:length(origins)) {
sel <- which(trip$orig_adm2 == stay$orig_adm2[i])[1]
stay$orig_x[i] <- stay$dest_x[i] <- trip$orig_x[sel]
stay$orig_y[i] <- stay$dest_y[i] <- trip$orig_y[sel]
stay$orig_pop[i] <- stay$dest_pop[i] <- trip$orig_pop[sel]
}
# Number of reported trip within home county
stay$trips <- rnbinom(length(origins), size=10, mu=1000)
# Combine trips and stays
suppressMessages(
travel_data <- dplyr::full_join(trip, stay)
)
head(travel_data, n=3)
#> date_start date_stop date_span indiv_id indiv_age indiv_sex indiv_type
#> 1 2020-01-01 2020-01-08 7 days NA NA NA <NA>
#> 2 2020-01-01 2020-01-08 7 days NA NA NA <NA>
#> 3 2020-01-01 2020-01-08 7 days NA NA NA <NA>
#> orig_adm0 orig_adm1 orig_adm2 orig_adm3 orig_adm4 orig_adm5 orig_type
#> 1 A B H <NA> <NA> <NA> County
#> 2 A B H <NA> <NA> <NA> County
#> 3 A B P <NA> <NA> <NA> County
#> orig_x orig_y orig_pop dest_adm0 dest_adm1 dest_adm2 dest_adm3 dest_adm4
#> 1 -88.60990 30.21513 3892 A B W <NA> <NA>
#> 2 -87.97855 29.39085 2787 A B D <NA> <NA>
#> 3 -89.88991 28.81821 4543 A B Q <NA> <NA>
#> dest_adm5 dest_type dest_x dest_y dest_pop trips
#> 1 <NA> County -89.78925 29.62802 9172 43
#> 2 <NA> County -87.43692 29.95134 9833 41
#> 3 <NA> County -89.75648 29.55376 10191 153