library("tibble")
library("jsonlite")
library("purrr")
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:jsonlite':
#>
#> flatten
The audience for this article is the developers of boxr, who may let many weeks or months pass without actively thinking about how the functions in this package:
- are set up; there is some variation here
- could be set up; Ian still argues with himself here, approaching with a different view every time he works on this repository.
At its heart, the goal of this package is to abstract away the complexities of using the Box API. We assume that a new user starts using this package with some familiarity with the Tidyverse, and r-lib packages like fs, so we aim to provide them with a familiar way of doing things.
Providing familiarity, particularly to emulate an opinionated framework like Tidyverse, requires us (as boxr developers) to introduce opinions. Thus, we also wish provide an “escape hatch”, which could be used by those who want to work outside of the Tidyverse, or outside of our opinions.
In Tidyverse, the base unit of analysis is the data frame. Among the boxr’s developers, it is uncontroversial that we should use data frames as much as possible. However, data frames come in different flavors:
- use tibble, or no.
- use nested data frames, or no.
Detour into Postel’s Law
I (Ian) am a firm believer that following Postel’s Law helps us (and our users) avoid hard-to-diagnose trouble. As you may know, Postel’s law says to be “flexible in what you accept; strict in what you return”. In other words, we should strive to accept and interpret users’ input so long as the intent is clear, but we should specify very clearly what a function returns and adhere strictly to that specification.
A famous Tidyverse example is how a subsetting a
data.frame
will, by default, return a vector
rather than a data.frame
if only one column is
specified:
str(mtcars[, c("wt", "mpg")])
#> 'data.frame': 32 obs. of 2 variables:
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
str(mtcars[, "mpg"])
#> num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
To avoid this behavior you can specify drop = FALSE
, but
this is sometimes forgotten – even by experienced R users:
str(mtcars[, "mpg", drop = FALSE])
#> 'data.frame': 32 obs. of 1 variable:
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
The tibble designs this problem away. Following Postel’s law, a subsetting a tibble always returns a tibble; if you want a vector, you have to call another function. It is strict with its output.
str(as_tibble(mtcars)[, "mpg"])
#> tibble [32 × 1] (S3: tbl_df/tbl/data.frame)
#> $ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
As we figure out what our functions return, I want to keep Postel’s Law in mind.
Box API
The boxr package is an exercise in abstracting away the Box API; sometimes this abstraction helps developers like me forget that it is actually there. It’s there.
The API is classified according to endpoints and resources; I think of these as analogous to R functions and objects. The Box API is comprehensive; we cannot possibly aspire to cover it all. Instead, our goal is to provide easy access to as many day-to-day endpoints as we can, and provide a way to help you to access others if you need to.
Some of our functions call to only one endpoint,
e.g. box_ls()
calls only the list
items in folder endpoint. Others of our functions call multiple
endpoints, e.g box_fetch()
calls the list-items endpoint,
as well as the download
file endpoint.
If a function calls a single endpoint (perhaps even repeatedly), it should return the response (or collection of responses) that the API returns. Consider the content of a sample response from the list-items endpoint:
content <-
fromJSON(
'{
"entries": [
{
"id": "12345",
"etag": "1",
"type": "file",
"sequence_id": "3",
"name": "Contract.pdf",
"sha1": "85136C79CBF9FE36BB9D05D0639C70C265C18D37",
"file_version": {
"id": "12345",
"type": "file_version",
"sha1": "134b65991ed521fcfe4724b7d814ab8ded5185dc"
}
}
],
"limit": 1000,
"offset": 2000,
"order": [
{
"by": "type",
"direction": "ASC"
}
],
"total_count": 5000
}',
simplifyVector = FALSE
)
The sample response shown on the Box web-page is different from the
response that I actually get. The example JSON, in the
"entries"
element, does not quote numeric values,
e.g. {"id": 0}
, whereas the actual response does
quote numeric values, e.g. {"id": "0"}
.
While this may seem inconvenient, it may help us out because although
elements like file id
are nominally integers, they are
often larger than R’s integer-maximum. For this reason, I think that
from boxr’s perspective, id
should remain a character
string. That said, I think we can parse other things:
- other, smaller, numbers as integers, in this case
"etag"
,"sequence_id"
. - datetimes, these are elements that seem to end with
"_at"
. - logicals; these are elements that seem to start with
"is_"
,"can_"
, or"has_"
.
Here’s the parsed content.
str(content)
#> List of 5
#> $ entries :List of 1
#> ..$ :List of 7
#> .. ..$ id : chr "12345"
#> .. ..$ etag : chr "1"
#> .. ..$ type : chr "file"
#> .. ..$ sequence_id : chr "3"
#> .. ..$ name : chr "Contract.pdf"
#> .. ..$ sha1 : chr "85136C79CBF9FE36BB9D05D0639C70C265C18D37"
#> .. ..$ file_version:List of 3
#> .. .. ..$ id : chr "12345"
#> .. .. ..$ type: chr "file_version"
#> .. .. ..$ sha1: chr "134b65991ed521fcfe4724b7d814ab8ded5185dc"
#> $ limit : int 1000
#> $ offset : int 2000
#> $ order :List of 1
#> ..$ :List of 2
#> .. ..$ by : chr "type"
#> .. ..$ direction: chr "ASC"
#> $ total_count: int 5000
In the content
list, only the entries
element has lasting information; the other elements deal with the
pagination.
# we could imagine this as a function that would contain all our parsing rules
parse_entry <- function(entry) {
# if we import tidyselect, we can use functions like `ends_with()`
entry <- purrr::map_at(entry, c("etag", "sequence_id"), as.numeric)
entry <- purrr::map_if(entry, is.list, parse_entry)
entry
}
entries <-
content$entries %>%
map(parse_entry)
str(entries)
#> List of 1
#> $ :List of 7
#> ..$ id : chr "12345"
#> ..$ etag : num 1
#> ..$ type : chr "file"
#> ..$ sequence_id : num 3
#> ..$ name : chr "Contract.pdf"
#> ..$ sha1 : chr "85136C79CBF9FE36BB9D05D0639C70C265C18D37"
#> ..$ file_version:List of 3
#> .. ..$ id : chr "12345"
#> .. ..$ type: chr "file_version"
#> .. ..$ sha1: chr "134b65991ed521fcfe4724b7d814ab8ded5185dc"
Here’s where things get interesting.
As it stands, many of boxr’s functions, e.g box_ls()
will return the entries
as a list of lists, attaching the
S3 class boxr_object_list
. It is minimally processed,
allowing you to do with it as you please.
This S3 class has an as.data.frame()
method which will
convert the element into a data frame. (If you want a data frame 99% of
the time, it is inconvenient to call as.data.frame()
99% of
the time.)
It behaves much like the internal function we have,
stack_rows_df()
:
boxr:::stack_rows_df(entries)
#> id etag type sequence_id name
#> 1 12345 1 file 3 Contract.pdf
#> sha1 file_version.id file_version.type
#> 1 85136C79CBF9FE36BB9D05D0639C70C265C18D37 12345 file_version
#> file_version.sha1
#> 1 134b65991ed521fcfe4724b7d814ab8ded5185dc
For those who prefer tibbles, we have another function,
stack_rows_tbl()
:
boxr:::stack_rows_tbl(entries)
#> # A tibble: 1 × 7
#> id etag type sequence_id name sha1 file_version
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <list>
#> 1 12345 1 file 3 Contract.pdf 85136C79CBF9FE36BB9D0… <named list>
A couple of things you might notice:
stack_rows_df()
returns adata.frame
. List items are unnested; the nested item names are delimited with a.
, e.g.file_version.id
.stack_rows_tbl()
returns a tibble. List items remain nested.
boxr functions
Right now, we have a few different ways to deal with return objects:
-
box_version_history()
: calls a single endpoint, returns a data frame, but we modify the columns: combiningtype
andid
intoversion_id
. -
box_collab_create()
: calls a single endpoint, returns a list with an S3 class"boxr_collab"
. This S3 class has anas.data.frame()
method, and anas_tibble()
method. -
box_ls()
: calls a single endpoint, returns a list with an S3 class"boxr_object_list"
. This S3 class has anas.data.frame()
method. -
box_fetch()
: calls multiple endpoints, returns a list with an S3 class"boxr_dir_wide_operation_result"
. This S3 class does not have anas.data.frame()
method.
The goal is to find a way to harmonize this, without causing too many backward incompatibilities.
Ideas for how to proceed
I’m thinking out loud here to sketch out ways to proceed so that we provide a consistent return object:
- day-to-day users receive a data-frame-like return object, in some “optimally-wrangled” form.
- other users can emulate the process and get the information they need.
We will walk through a simplified reimagining of the
box_ls()
function.
Using `BOX_CLIENT_ID` from environment
Using `BOX_CLIENT_SECRET` from environment
boxr: Authenticated using OAuth2 as Ian LYTTLE (ian.lyttle@se.com, id: 196942982)
Single function to call the API
Let’s imagine a single function in the package that calls the API. It will be more involved than this, but it will give you an idea.
# this works for Ian's Box account - no-one else
dir_id <- "123053109701"
# returns a httr response object
box_api_response <- function(verb, endpoint) {
response <-
httr::RETRY(
verb,
glue::glue("https://api.box.com/2.0/{endpoint}"),
boxr:::get_token(),
terminate_on = boxr:::box_terminal_http_codes()
)
response
}
response <- box_api_response("GET", glue::glue("folders/{dir_id}/items/"))
response
Response [https://api.box.com/2.0/folders/123053109701/items/]
Date: 2020-10-17 01:04
Status: 200
Content-Type: application/json
Size: 640 B
Extract content
At this point, we have no idea if the response is any good or not, nor have we extracted the content.
box_content <- function(response, task = NULL) {
httr::stop_for_status(response, task = task)
text <- httr::content(response, as = "text", encoding = "UTF-8")
# we may want to deviate from the defaults
content <- jsonlite::fromJSON(text, simplifyDataFrame = FALSE)
content
}
This lets someone get a JSON list, or an error message if the response is bad.
content <- box_content(response, task = "get directory listing")
str(content)
List of 5
$ total_count: int 2
$ entries :List of 2
..$ :List of 7
.. ..$ type : chr "file"
.. ..$ id : chr "721629732867"
.. ..$ file_version:List of 3
.. .. ..$ type: chr "file_version"
.. .. ..$ id : chr "767453805267"
.. .. ..$ sha1: chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
.. ..$ sequence_id : chr "0"
.. ..$ etag : chr "0"
.. ..$ sha1 : chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
.. ..$ name : chr "another-attempt-at-dark-mode.pdf"
..$ :List of 7
.. ..$ type : chr "file"
.. ..$ id : chr "721628453889"
.. ..$ file_version:List of 3
.. .. ..$ type: chr "file_version"
.. .. ..$ id : chr "767454763288"
.. .. ..$ sha1: chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
.. ..$ sequence_id : chr "2"
.. ..$ etag : chr "2"
.. ..$ sha1 : chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
.. ..$ name : chr "ctz-widget.txt"
$ offset : int 0
$ limit : int 100
$ order :List of 2
..$ :List of 2
.. ..$ by : chr "type"
.. ..$ direction: chr "ASC"
..$ :List of 2
.. ..$ by : chr "name"
.. ..$ direction: chr "ASC"
Parse content
Now, it may be interesting to parse the content into a list. We can
use the parse_entry()
function from above. Note that some
endpoints return an entries
element, others don’t. This one
does.
box_parse_entries <- function(entries) {
purrr::map(entries, parse_entry)
}
parsed <- box_parse_entries(content$entries)
str(parsed)
List of 2
$ :List of 7
..$ type : chr "file"
..$ id : chr "721629732867"
..$ file_version:List of 3
.. ..$ type: chr "file_version"
.. ..$ id : chr "767453805267"
.. ..$ sha1: chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
..$ sequence_id : num 0
..$ etag : num 0
..$ sha1 : chr "c66f70f6c65f8cd381434a56165640d50fb3a9c2"
..$ name : chr "another-attempt-at-dark-mode.pdf"
$ :List of 7
..$ type : chr "file"
..$ id : chr "721628453889"
..$ file_version:List of 3
.. ..$ type: chr "file_version"
.. ..$ id : chr "767454763288"
.. ..$ sha1: chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
..$ sequence_id : num 2
..$ etag : num 2
..$ sha1 : chr "69ad086c3f8d96b991b8f8bcce67b95708397b1a"
..$ name : chr "ctz-widget.txt"
Stack in tabular form
The parsed content (here at least) is a list of lists. We can stack this into a tibble from the parsed info:
tbl <- boxr:::stack_rows_tbl(parsed)
tbl
# A tibble: 2 x 7
type id file_version sequence_id etag sha1 name
<chr> <chr> <list> <dbl> <dbl> <chr> <chr>
1 file 72162973… <named list [… 0 0 c66f70f6c65f8cd381434a… another-attempt-at…
2 file 72162845… <named list [… 2 2 69ad086c3f8d96b991b8f8… ctz-widget.txt
Wrangle
For this function, we do not propose any post-processing of the
stacked content. However, box_version_history()
does this:
combining type
and id
into
version_id
.
All together
We now have the building blocks for our reimagined
box_ls()
function:
box_dir_info <- function(dir_id) {
response <- box_api_response("GET", glue::glue("folders/{dir_id}/items/"))
entries <- box_content(response, task = "get directory listing")[["entries"]]
# The above is an oversimplification. In actuality, these two functions
# would be combined into one function that would take care of the pagination,
# something like:
#
# entries <-
# box_api_entries(
# "GET",
# endpoint = glue::glue("folders/{dir_id}/items/"),
# task = "get directory listing"
# )
#
# box_api_entries() would call box_api_response() and box_content()
parsed <- box_parse_entries(entries)
stacked <- boxr:::stack_rows_tbl(parsed)
# not doing anything here, but box_version_history() changes some columns
wrangled <- stacked
wrangled
}
box_dir_info(dir_id)
# A tibble: 2 x 7
type id file_version sequence_id etag sha1 name
<chr> <chr> <list> <dbl> <dbl> <chr> <chr>
1 file 72162973… <named list [… 0 0 c66f70f6c65f8cd381434a… another-attempt-at…
2 file 72162845… <named list [… 2 2 69ad086c3f8d96b991b8f8… ctz-widget.txt
There are five distinct steps, each of which could be adapted to particular circumstances, each of which could be exposed to the user so they can “roll their own”:
get the response from the Box API.
check the response and extract the content.
parse the content (convert strings to datetimes, etc.).
stack the parsed content into a canonical tabular form (data frame or tibble).
wrangle the stacked content (rename columns, etc.).
Also, there would be three “families” of functions:
- those that make potentially multiple calls to a single endpoint, but
response has
entries
(implying pagination), e.g.box_ls()
. - those that make a single call to a single endpoint,
e.g.
box_collab_create()
. - those that make all sorts of calls to all sorts of endpoints,
e.g.
box_fetch()
.
The point of this vignette, in its current form, is to sketch out how the first two families might work. The third family will require more consideration and considerably more coffee.
This could simplify the creation of new box functions, and perhaps
let us simplify some existing ones. We could export
box_api_response()
, box_content()
,
box_parse_entries()
(and box_parse_entry()
),
and stack_rows_tbl()
; this would allow someone to access
the Box API themselves, much-more-easily.
Of course, the functions would have better-thought-out names, and would be more complicated themselves. However, the areas of responsibility for each function would be the same.
Questions and lingering issues
What should be the canonical form of data that we return?
- I can see an argument for the tibble, given that we are Tidyverse-friendly already. I can also see the appeal of keeping things nested.
- I can also appreciate the appeal of the data frame, as Nate put it: “The US Dollar of data analysis”.
- Should we pick one of these, we can always provide a selection of
helpers, e.g.
box_tibble()
,box_nest()
,box_unnest()
,box_data_frame()
. These functions could be used to translate among the formats.
One way that we can avoid “breaking changes” is to create a new function with a new name for the new functionality. We can then “supersede” or “deprecate” the old function.
The problem comes when an old function has a really good name.
Documentation normalization
Another thing we would like to do is to make the documentation simpler for us to maintain. With this release, we take two steps in that direction:
-
an internal function
string_side_effects()
:boxr:::string_side_effects() #> [1] "Invisible `NULL`, called for side effects."
This is useful to specify a return value:
#' @return `r string_side_effects()`
-
canonical parameter-definitions:
-
box_browse()
:file_id
,dir_id
-
box_dl()
:local_dir
,file_name
,overwrite
,version_id
,version_no
(file_id
also available) -
box_ul()
:description
(dir_id
also available)
This cuts down on the possibilities for invoking different functions when we need only invoke one or two:
#' @inheritParams box_browse
-
As we notice more duplication, we can add to this section.