An introduction to the cassowaryr package

About

The term scagnostics refers to scatter plot diagnostics, originally described by John and Paul Tukey. This is a collection of techniques for automatically extracting interesting visual features from pairs of variables. This package is an implementation of graph theoretic scagnostics developed by Wilkinson, Anand, and Grossman (2005) in pure R and is designed to be easily integrated into a tidy data workflow.

The cassowaryr package provides functions to compute scagnostics on pairs of numeric variables in a data set.

The package’s primary use is as a step in exploratory data analysis, to give users an idea of the shape of their data and identify interesting pairwise relationships.

Installation

The package can be installed from CRAN using

install.packages("cassowaryr")

and from GitHub using

remotes::install_github("numbats/cassowaryr")

to install the development version.

Examples

Calculating the scagnostics

The usage is illustrated with the package’s example data, datasauRus dozen. This data is also available in the datasauRus package. There are several pairs of variables that have with the same mean, variance and correlation but strikingly different visual features. We will use a handful of these pairwise plots to show the best way to utilise the cassowaryr package. Here is a plot of the selected datasauRus dozen plots:

library(cassowaryr)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# pick examples
exampledata <- datasaurus_dozen %>%
  filter(dataset %in% c("slant_up", "circle", "dots", "away"))


#plot them
exampledata %>%
  ggplot(aes(x=x,y=y, colour=dataset))+
  geom_point() +
  facet_wrap("dataset") +
  theme_minimal() +
  theme(legend.position = "none", aspect.ratio=1)

From a data frame, there are several ways to calculate scagnostics. If we simply have two variables we wish to calculate several scagnostics on, we use the calc_scags function and pass through the two variables as vectors.

calc_scags(exampledata$x, exampledata$y, scags=c("clumpy2", "convex", "striated2")) %>%
  knitr::kable(digits=4, align="c")
#> Warning in calc_scags(exampledata$x, exampledata$y, scags = c("clumpy2", :
#> 'striated2' is no longer an available scagnostic, please use 'grid' instead
grid clumpy2 convex
0.1335 0 0.6987

If instead we have a data frame with two variables and a grouping variable (a long form of a data set) then we can use the calc_scags function to get the scagnostics for each group.

longscags <- exampledata %>%
  group_by(dataset) %>%
  summarise(calc_scags(x, y, scags=c("clumpy2", "convex", "striated2", "dcor")))
#> Warning: There were 4 warnings in `summarise()`.
#> The first warning was:
#> ℹ In argument: `calc_scags(...)`.
#> ℹ In group 1: `dataset = "away"`.
#> Caused by warning in `calc_scags()`:
#> ! 'striated2' is no longer an available scagnostic, please use 'grid' instead
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
longscags %>%
  knitr::kable(digits=4, align="c")
dataset grid clumpy2 convex dcor
away 0.0687 0.0903 0.5789 0.1326
circle 0.7470 0.0000 0.0102 0.2292
dots 0.1034 0.9698 0.0009 0.1266
slant_up 0.1074 0.8397 0.2886 0.1932

Finally, if we have a wide data set consisting of only numerical variables, we can use the calc_scags_wide to find the scagnostics on every pairwise combination of variables.

exampledata_wide <- datasaurus_dozen_wide[,c(1:2,5:6,9:10,17:18)]
widescags<- calc_scags_wide(exampledata_wide, scags=c("clumpy2", "convex", "striated2", "dcor"))
#> Warning in calc_scags_wide(exampledata_wide, scags = c("clumpy2", "convex", :
#> Please use grid instead of striated2
head(widescags, 4) %>%
  knitr::kable(digits=4, align="c")
Var1 Var2 grid clumpy2 convex dcor
away_y away_x 0.0703 0.0220 0.5514 0.1326
circle_x away_x 0.0615 0.0000 0.6573 0.3839
circle_x away_y 0.0526 0.1167 0.8117 0.1142
circle_y away_x 0.0992 0.0000 0.2853 0.0818

Using the scagnostics

If the resulting scagnostic data set is small enough, we can find interesting scatter plots by simply looking at the table, however this is often not the case. If we want to find pairwise plots that are different to the others, we can find outliers on combinations of the scagnostics using an interactive scatter plot matrix (SPLOM). The code (but not the output) on how to do this is shown below:

library(GGally)
library(plotly)

splom_data <- widescags %>%
  mutate(lab = paste0(Var1, " , ", Var2)) %>%
  select(-c(Var1, Var2))

p <- ggpairs(splom_data, columns=c(1:4), aes(label=lab)) +
  theme_minimal()
ggplotly(p) 

There are several functions that can summarise the scagnostics results. Using top_pairs allows us to find the top scagnostic for each pair of variables, while top_scags finds the top pair of variables for each scagnostic. Their usage is similar and looks like:

top_scag(widescags) %>%
  knitr::kable(digits=4, align="c")
Var1 Var2 scag value
away_y away_x convex 0.5514
circle_x away_x convex 0.6573
circle_x away_y convex 0.8117
circle_y away_x convex 0.2853
circle_y away_y convex 0.3162
circle_y circle_x grid 0.7195
dots_x away_x clumpy2 0.9296
dots_x away_y clumpy2 0.9192
dots_x circle_x clumpy2 0.9352
dots_x circle_y clumpy2 0.9582
dots_y away_x clumpy2 0.9440
dots_y away_y clumpy2 0.9425
dots_y circle_x clumpy2 0.9314
dots_y circle_y clumpy2 0.9320
dots_y dots_x clumpy2 0.9713
slant_up_x away_x convex 0.7874
slant_up_x away_y convex 0.8264
slant_up_x circle_x dcor 0.8140
slant_up_x circle_y clumpy2 0.9082
slant_up_x dots_x clumpy2 0.9277
slant_up_x dots_y clumpy2 0.9332
slant_up_y away_x convex 0.8422
slant_up_y away_y convex 0.4896
slant_up_y circle_x convex 0.8402
slant_up_y circle_y dcor 0.8975
slant_up_y dots_x clumpy2 0.9290
slant_up_y dots_y clumpy2 0.9422
slant_up_y slant_up_x clumpy2 0.8430

Drawing functions

Occasionally we will get unexpected results for a scagnostic. To diagnose a scagnostic result, the package has several draw functions that will plot the graph based objects that are used to construct the measures: draw_alphahull(), draw_convexhull() and draw_mst(). Below shows the MST drawn for the dots pair of variables in the datasaurus_dozen, and it can be seen to have some difficulty, as would be expected, defining the MST when all points are equidistant.

drawexample <- exampledata %>%
  filter(dataset== "dots")

draw_mst(drawexample$x, drawexample$y) + theme_minimal()

References

Tukey, J. W. and Tukey, P. A. (1985). “Computer graphics and exploratory data analysis: An introduction”, In Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics’85, 3:773-785. National Computer Graphics Association, Fairfax, VA.

Wilkinson, L., Anand, A. and Grossman, R. (2005) “Graph-theoretic scagnostics”, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005., pp. 157-164, doi: 10.1109/INFVIS.2005.1532142.