| Title: | Compute Scagnostics on Pairs of Numeric Variables in a Data Set |
|---|---|
| Description: | Computes a range of scatterplot diagnostics (scagnostics) on pairs of numerical variables in a data set. A range of scagnostics, including graph and association-based scagnostics described by Leland Wilkinson and Graham Wills (2008) <doi:10.1198/106186008X320465> and association-based scagnostics described by Katrin Grimm (2016,ISBN:978-3-8439-3092-5) can be computed. Summary and plotting functions are provided. |
| Authors: | Harriet Mason [aut, cre] (ORCID: <https://orcid.org/0009-0007-4568-8215>), Stuart Lee [aut] (ORCID: <https://orcid.org/0000-0003-1179-8436>), Ursula Laa [aut] (ORCID: <https://orcid.org/0000-0002-0249-6439>), Dianne Cook [aut] (ORCID: <https://orcid.org/0000-0002-3813-7155>), Tina Rashid Jafari [aut] (ORCID: <https://orcid.org/0009-0008-3605-5341>) |
| Maintainer: | Harriet Mason <[email protected]> |
| License: | GPL-3 |
| Version: | 2.0.21 |
| Built: | 2026-06-02 18:47:24 UTC |
| Source: | https://github.com/numbats/cassowaryr |
All variables and pairs of variables have same summary statistics but are very different data, as can be seen by visualisation.
A tibble with 44 observations and 3 variables
label of the data set, each set has 11 observations
variable for horizontal axis
variable for vertical axis
This function allows you to calculate a large number of scagnostics quickly and efficiently. While the individual scagnostic calculation functions (sc_) are good for looking at a single scagnostic, it is inefficient when computing more than one scagnostic. This is becasue the sc_ functions recompute the graph object for each plot and scagnostic pair, even if the graph objects are unchanged. Additionally, typing the functions over and over again quickly becomes tedious. For this reason, we have the calc_scags function that will reuse the same graph object for all scagnostics.
calc_scags( x, y, scags = c("outlying", "stringy", "striated", "clumpy", "sparse", "skewed", "convex", "skinny", "monotonic"), out.rm = TRUE, binner = "hex", alpha = "rahman" )calc_scags( x, y, scags = c("outlying", "stringy", "striated", "clumpy", "sparse", "skewed", "convex", "skinny", "monotonic"), out.rm = TRUE, binner = "hex", alpha = "rahman" )
x, y
|
numeric vectors |
scags |
collection of strings matching names of scagnostics to calculate: outlying, stringy, striated, grid, striped, clumpy, clumpy2, sparse, skewed, convex, skinny, monotonic, splines, dcor |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
A data frame with all selected scagnostic values for a particular x, y pair.
calc_scags_wide
# Calculate selected scagnostics on a single pair calc_scags(anscombe$x1, anscombe$y1, scags=c("monotonic", "outlying")) # Compute on long form data, or subsets # defined by a categorical variable require(dplyr) datasaurus_dozen |> group_by(dataset) |> summarise(calc_scags(x,y, scags=c("monotonic", "outlying", "convex")))# Calculate selected scagnostics on a single pair calc_scags(anscombe$x1, anscombe$y1, scags=c("monotonic", "outlying")) # Compute on long form data, or subsets # defined by a categorical variable require(dplyr) datasaurus_dozen |> group_by(dataset) |> summarise(calc_scags(x,y, scags=c("monotonic", "outlying", "convex")))
It is quite common to have data in a wide format that is not suitable to feed into the calc_scags function that would need a long format. To save users time and energy we also provide a wide version of the calc_scags function. This function will compute all selected scagnostics for every pair wise set of variables in the data frame.
calc_scags_wide( all_data, scags = c("outlying", "stringy", "striated", "clumpy", "sparse", "skewed", "convex", "skinny", "monotonic"), out.rm = TRUE, binner = "hex", alpha = "rahman" )calc_scags_wide( all_data, scags = c("outlying", "stringy", "striated", "clumpy", "sparse", "skewed", "convex", "skinny", "monotonic"), out.rm = TRUE, binner = "hex", alpha = "rahman" )
all_data |
tibble of wide multivariate data |
scags |
collection of strings matching names of scagnostics to calculate: outlying, stringy, striated, grid, striped, clumpy, clumpy2, sparse, skewed, convex, skinny, monotonic, splines, dcor |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
A data frame that gives the scagnostic scores for every possible pair of variables.
calc_scags
# Calculate selected scagnostics data(pk) calc_scags_wide(pk[,2:5], scags=c("outlying","monotonic"))# Calculate selected scagnostics data(pk) calc_scags_wide(pk[,2:5], scags=c("outlying","monotonic"))
From the datasauRus package. A modern update of Anscombe. All plots have same x and y mean, variance and correlation, but look different visually.
All variables and pairs of variables have same summary statistics but are very different data, as can be seen by visualisation.
A tibble with 1,846 observations and 3 variables
label of data set
variable for horizontal axis
variable for vertical axis
A tibble with 142 observations and 26 variables
x and y variables for away data
x and y variables for bullseye data
x and y variables for circle data
x and y variables for dino data
x and y variables for dots data
x and y variables for h_lines data
x and y variables for high_lines data
x and y variables for slant_down data
x and y variables for slant_up data
x and y variables for star data
x and y variables for v_lines data
x and y variables for wide_lines data
x and y variables for star data
x and y variables for x_shape data
Identifies which observations are kept or removed by the outlier removal step for a chosen pair of numeric variables.
diagnose_outliers(data, x, y)diagnose_outliers(data, x, y)
data |
A data frame or tibble. |
x, y
|
Two numeric columns to use for the outlier removal diagnostic. |
A data frame with the same rows and columns as data, plus an
outlier_status column. Values are "Kept" or "Removed".
data <- data.frame( id = 1:10, var1 = c(1, 2, 3, 4, 5, 6, 7, 20, 21, 22), var2 = c(1, 2, 3, 4, 5, 6, 7, 20, -20, 22) ) diagnose_outliers(data, var1, var2)data <- data.frame( id = 1:10, var1 = c(1, 2, 3, 4, 5, 6, 7, 20, 21, 22), var2 = c(1, 2, 3, 4, 5, 6, 7, 20, -20, 22) ) diagnose_outliers(data, var1, var2)
These functions will draw the graph objects that are used to compute the scagnostics. They are useful for debugging and seeing the impact of parameter adjustments, alpha bining, or outlier removal. You can draw the MST, convex hull, and alpha hull with each respective draw_* function.
draw_alphahull( x, y, out.rm = TRUE, binner = "hex", alpha = "rahman", fill = FALSE ) draw_mst(x, y, out.rm = TRUE, binner = "hex") draw_convexhull(x, y, out.rm = TRUE, binner = "hex", fill = FALSE)draw_alphahull( x, y, out.rm = TRUE, binner = "hex", alpha = "rahman", fill = FALSE ) draw_mst(x, y, out.rm = TRUE, binner = "hex") draw_convexhull(x, y, out.rm = TRUE, binner = "hex", fill = FALSE)
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
fill |
set to TRUE if you want the polygon filled |
A ggplot object that shows the respective graph object
require(dplyr) require(ggplot2) require(alphahull) cl <- features |> filter(feature == "clusters") # draw the alpha hull draw_alphahull(cl$x, cl$y) # draw the MST draw_mst(cl$x, cl$y) # draw the convex hull draw_convexhull(cl$x, cl$y) # You can utilise these functions to see the impact of parameter changes draw_alphahull(cl$x, cl$y, alpha = "omega")require(dplyr) require(ggplot2) require(alphahull) cl <- features |> filter(feature == "clusters") # draw the alpha hull draw_alphahull(cl$x, cl$y) # draw the MST draw_mst(cl$x, cl$y) # draw the convex hull draw_convexhull(cl$x, cl$y) # You can utilise these functions to see the impact of parameter changes draw_alphahull(cl$x, cl$y, alpha = "omega")
Simulated data with common features that might be seen in 2D data. Variable are feature, x, y.
A tibble with 1,013 observations and 3 variables, and 15 different patterns
label of data set
variable for horizontal axis
variable for vertical axis
There are 7 variables (x1-x7) and 2,100 observations. Variables 4 and 7 have the numbat. The rest are noise. Group A has the numbat, and group B is all noise.
Biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
A tibble with 1,013 observations and 3 variables
ASCII subject name and recording number
MDVP:Fo(Hz)Average vocal fundamental frequency
MDVP:Fhi(Hz)Maximum vocal fundamental frequency
MDVP:Flo(Hz)Minimum vocal fundamental frequency
MDVP:Jitter,MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP
Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA
Several measures of variation in amplitude
NHR,HNR
Two measures of ratio of noise to tonal components in the voice
statusHealth status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2
Two nonlinear dynamical complexity measures
DFASignal fractal scaling exponent
spread1,spread2,PPE
Three nonlinear measures of fundamental frequency variation
The data is available at The UCI Machine Learning Repository in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.
The data are originally analysed in: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering.
This measure is used to detect clustering and is calculated through an iterative process. It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005). First an edge J is selected and removed from the MST. From the two spanning trees that are created by this break, we select the largest edge from the smaller tree (K). The length of this edge (K) is compared to the removed edge (J) giving a clumpy measure for this edge. This process is repeated for every edge in the MST and the final clumpy measure is the maximum of this value over all edges.
sc_clumpy(x, y, out.rm = TRUE, binner = "hex")sc_clumpy(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's clumpy score.
require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(clumpy = sc_clumpy(x,y)) # using two vectors x <- datasaurus_dozen_wide$slant_up_x y <- datasaurus_dozen_wide$slant_up_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_clumpy(x, y)require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(clumpy = sc_clumpy(x,y)) # using two vectors x <- datasaurus_dozen_wide$slant_up_x y <- datasaurus_dozen_wide$slant_up_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_clumpy(x, y)
A computation for clumpy that is supposed to make the measure more robust to changes in binning. The scagnostic is defined in Improving the Robustness of Scagnostics, Wang, et al. (2020).
sc_clumpy_r(x, y, out.rm = TRUE, binner = "hex")sc_clumpy_r(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's robust clumpy score.
require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(clumpy = sc_clumpy_r(x,y)) sc_clumpy_r(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(clumpy = sc_clumpy_r(x,y)) sc_clumpy_r(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)
This measure is defined in the cassowaryr paper by Mason, et al. (2025). It is an alternative measure for clumpiness. It is the ratio of the between cluster edges and the within cluster edges. It is a good alternative measure to clumpy when binning is removed as a pre-processing step.
sc_clumpy2(x, y, out.rm = TRUE, binner = "hex")sc_clumpy2(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's adjusted clumpy score.
require(ggplot2) require(dplyr) # plot features ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate clumpy2 on all features features |> group_by(feature) |> summarise(clumpy2 = sc_clumpy2(x,y)) sc_clumpy2(datasaurus_dozen_wide$dots_x, datasaurus_dozen_wide$dots_y) data <- features |> filter(feature == "clusters") x <- data$x y <- data$y # calculate using vectors sc_clumpy2(x, y)require(ggplot2) require(dplyr) # plot features ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate clumpy2 on all features features |> group_by(feature) |> summarise(clumpy2 = sc_clumpy2(x,y)) sc_clumpy2(datasaurus_dozen_wide$dots_x, datasaurus_dozen_wide$dots_y) data <- features |> filter(feature == "clusters") x <- data$x y <- data$y # calculate using vectors sc_clumpy2(x, y)
A measure of how convex the shape of the data is. It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005).Computed as the ratio between the area of the alpha hull and convex hull. Unlike the other scagnostic measures, a high value on convex does not correlate to an interesting scatter plot, rather it usually indicates a lack of relationship between the two variables.
sc_convex(x, y, alpha = "rahman", out.rm = TRUE, binner = "hex")sc_convex(x, y, alpha = "rahman", out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's convex score.
require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(convex = sc_convex(x,y)) sc_convex(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(convex = sc_convex(x,y)) sc_convex(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)
The distance correlation between X and Y defined by Székely, et al. in Measuring and testing dependence by correlation of distances. The measure was suggested as an association scagnostic in Katrin Grimm's PhD thesis (2016). Distance correlation is a measure of non-linear dependence which is 0 if and only if the two variables are independent. It is computed using an ANOVA like calculation on the pairwise distances between observations.
sc_dcor(x, y)sc_dcor(x, y)
x |
numeric vector |
y |
numeric vector |
A "numeric" object that gives the plot's dcor score.
require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_dcor(anscombe$x1, anscombe$y1) sc_dcor(anscombe$x2, anscombe$y2) sc_dcor(anscombe$x3, anscombe$y3) sc_dcor(anscombe$x4, anscombe$y4)require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_dcor(anscombe$x1, anscombe$y1) sc_dcor(anscombe$x2, anscombe$y2) sc_dcor(anscombe$x3, anscombe$y3) sc_dcor(anscombe$x4, anscombe$y4)
The grid scagnsotic as defined in Adam Rahman's PhD thesis (2018). The scagnostic identifies grid-like structures by counting the number of 90 and 180 degree angles in the MST. This measure can be used as an effective alternative to striated when computing scagnostics without binning.
sc_grid(x, y, epsilon = 0.01, out.rm = TRUE, binner = "hex") sc_striated2(x, y, epsilon = 0.01, out.rm = TRUE, binner = "hex")sc_grid(x, y, epsilon = 0.01, out.rm = TRUE, binner = "hex") sc_striated2(x, y, epsilon = 0.01, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
epsilon |
the error tolerance allowed when deciding if the MST angles are at a right angle or not |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's grid score.
require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(grid1 = sc_grid(x,y), grid2 = sc_grid(x,y, epsilon=0.05)) sc_striated2(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(grid1 = sc_grid(x,y), grid2 = sc_grid(x,y, epsilon=0.05)) sc_striated2(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)
The measure of Spearman correlation calculated using the stats package cor function with method='spearman'.
sc_monotonic(x, y)sc_monotonic(x, y)
x |
numeric vector |
y |
numeric vector |
A "numeric" object that gives the plot's monotonic score.
stats::cor
require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_monotonic(anscombe$x1, anscombe$y1) sc_monotonic(anscombe$x2, anscombe$y2) sc_monotonic(anscombe$x3, anscombe$y3) sc_monotonic(anscombe$x4, anscombe$y4)require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_monotonic(anscombe$x1, anscombe$y1) sc_monotonic(anscombe$x2, anscombe$y2) sc_monotonic(anscombe$x3, anscombe$y3) sc_monotonic(anscombe$x4, anscombe$y4)
A measure of proportion and severity of outliers in the dataset. It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005). It is calculated by comparing the edge lengths of the outlying points in the MST with the total length of all the edges in the MST.
sc_outlying(x, y, binner = "hex")sc_outlying(x, y, binner = "hex")
x, y
|
numeric vectors |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's outlying score.
require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(outlying = sc_outlying(x,y)) # using two vectors x <- datasaurus_dozen_wide$away_x y <- datasaurus_dozen_wide$away_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate scag sc_outlying(x, y)require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(outlying = sc_outlying(x,y)) # using two vectors x <- datasaurus_dozen_wide$away_x y <- datasaurus_dozen_wide$away_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate scag sc_outlying(x, y)
A measure of skewness in the edge lengths of the MST (not in the distribution of the data). It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005). It is the ratio between the 90th to 50th percentile range and the 10th to 90th percentile range.
sc_skewed(x, y, out.rm = TRUE, binner = "hex")sc_skewed(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's skewed score.
require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(skewed = sc_skewed(x,y)) # using two vectors x <- datasaurus_dozen_wide$away_x y <- datasaurus_dozen_wide$away_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_skewed(x, y)require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(skewed = sc_skewed(x,y)) # using two vectors x <- datasaurus_dozen_wide$away_x y <- datasaurus_dozen_wide$away_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_skewed(x, y)
A measure of how “thin” the shape of the data is. It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005). It is calculated as the ratio between the area and perimeter of the alpha hull with some normalisation such that 0 correspond to a perfect circle and values close to 1 indicate a skinny polygon.
sc_skinny(x, y, alpha = "rahman", out.rm = TRUE, binner = "hex")sc_skinny(x, y, alpha = "rahman", out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's skinny score.
require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(skinny = sc_skinny(x,y)) sc_skinny(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)require(ggplot2) require(dplyr) ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") features |> group_by(feature) |> summarise(skinny = sc_skinny(x,y)) sc_skinny(datasaurus_dozen_wide$away_x, datasaurus_dozen_wide$away_y)
Identifies if the data is confined to a small number of locations on the plane. It was first defined in Scagnostics Distributions by Wilkinson & Wills (2008). It is calculated as the 90th percentile of MST edge lengths
sc_sparse(x, y, out.rm = TRUE, binner = "hex")sc_sparse(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's sparse score.
require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(sparse = sc_sparse(x,y)) # using two vectors x <- datasaurus_dozen_wide$dots_x y <- datasaurus_dozen_wide$dots_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_sparse(x, y)require(ggplot2) require(dplyr) # plot the feature ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(sparse = sc_sparse(x,y)) # using two vectors x <- datasaurus_dozen_wide$dots_x y <- datasaurus_dozen_wide$dots_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_sparse(x, y)
The sparse2 measure created for cassowaryr The measure calculates the sparsity of the plot as 1-area(ahull).
sc_sparse2(x, y, out.rm = TRUE, binner = "hex")sc_sparse2(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's adjusted sparse score.
require(ggplot2) require(tidyr) require(dplyr) data(anscombe_tidy) ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_sparse2(anscombe$x1, anscombe$y1)require(ggplot2) require(tidyr) require(dplyr) data(anscombe_tidy) ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_sparse2(anscombe$x1, anscombe$y1)
Measures the functional non-linear dependence by fitting a penalised splines model on X using Y, and on Y using X. The measure was defined as an association scagnostic in Katrin Grimm's PhD thesis (2016) The variance of the residuals are scaled down by the axis so they are comparable, and finally the maximum is taken. Therefore the value will be closer to 1 if either relationship can be decently explained by a splines model.
sc_splines(x, y)sc_splines(x, y)
x |
numeric vector |
y |
numeric vector |
A "numeric" object that gives the plot's spines score.
require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_splines(anscombe$x1, anscombe$y1) sc_splines(anscombe$x2, anscombe$y2) sc_splines(anscombe$x3, anscombe$y3)require(ggplot2) require(tidyr) require(dplyr) data(anscombe) anscombe_tidy <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") ggplot(anscombe_tidy, aes(x=x, y=y)) + geom_point() + facet_wrap(~set, ncol=2, scales = "free") sc_splines(anscombe$x1, anscombe$y1) sc_splines(anscombe$x2, anscombe$y2) sc_splines(anscombe$x3, anscombe$y3)
This measure identifies features such as discreteness by finding parallel lines. It was first defined in Graph Theoretic Scagnostics, Wilkinson, et al. (2005). It is calculated by counting the proportion of vertices with only two edges that have an inner angle approximately between 135 and 220 degrees.
sc_striated(x, y, out.rm = TRUE, binner = "hex")sc_striated(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric object that gives the plot's striated score.
require(ggplot2) require(dplyr) # plot the features data ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(striated = sc_striated(x,y)) |> arrange(striated) # using just vectors of points x <- datasaurus_dozen_wide$v_lines_x y <- datasaurus_dozen_wide$v_lines_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate scagnostic sc_striated(x, y)require(ggplot2) require(dplyr) # plot the features data ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(striated = sc_striated(x,y)) |> arrange(striated) # using just vectors of points x <- datasaurus_dozen_wide$v_lines_x y <- datasaurus_dozen_wide$v_lines_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate scagnostic sc_striated(x, y)
Computes the stringy measure as defined in Graph-Theoretic Scagnostics (Wilkinson et al., 2005). It is the legnth of the longest shortest path through the MST divided by the sum of all edge lengths in the MST.
sc_stringy05(x, y, out.rm = TRUE, binner = "hex") sc_stringy2(x, y, out.rm = TRUE, binner = "hex")sc_stringy05(x, y, out.rm = TRUE, binner = "hex") sc_stringy2(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
A numeric value giving the stringy05 score.
x <- datasaurus_dozen_wide$star_x y <- datasaurus_dozen_wide$star_y sc_stringy05(x, y)x <- datasaurus_dozen_wide$star_x y <- datasaurus_dozen_wide$star_y sc_stringy05(x, y)
This measure identifies a “stringy” shape with no branches, such as a thin line of data. The stringy06 function is defined in High-Dimensional Visual Analytics: Interactive Exploration Guided by Pairwise Views of Point Distributions (Wilkinson et al., 2006). It is calculated using the minimum spanning tree (MST) by comparing the number of vertices with degree two to the total number of vertices, dropping those of degree one.
sc_stringy06(x, y, out.rm = TRUE, binner = "hex") sc_stringy(x, y, out.rm = TRUE, binner = "hex")sc_stringy06(x, y, out.rm = TRUE, binner = "hex") sc_stringy(x, y, out.rm = TRUE, binner = "hex")
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
The name "stringy06" is used to distinguish this version from earlier definition of the stringy measure.
A numeric object that gives the plot's stringy score.
require(ggplot2) require(dplyr) # plot the features data ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(stringy = sc_stringy06(x,y)) # using just vectors of points x <- datasaurus_dozen_wide$star_x y <- datasaurus_dozen_wide$star_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_stringy06(x, y)require(ggplot2) require(dplyr) # plot the features data ggplot(features, aes(x=x, y=y)) + geom_point() + facet_wrap(~feature, ncol = 5, scales = "free") # calculate using tidy code features |> group_by(feature) |> summarise(stringy = sc_stringy06(x,y)) # using just vectors of points x <- datasaurus_dozen_wide$star_x y <- datasaurus_dozen_wide$star_y # plot it ggplot() + geom_point(aes(x = x, y = y)) # calculate using vectors sc_stringy06(x, y)
This metric computes the 1-(ratio between the number of unique values to total data values) on number of rotations of the data, and returns the smallest value. If this value is large it means that there are only a few unique data values, and hence the distribution is discrete
sc_striped(x, y)sc_striped(x, y)
x |
numeric vector |
y |
numeric vector |
double
data("datasaurus_dozen_wide") sc_striped(datasaurus_dozen_wide$v_lines_x, datasaurus_dozen_wide$v_lines_y) sc_striped(datasaurus_dozen_wide$dino_x, datasaurus_dozen_wide$dino_y)data("datasaurus_dozen_wide") sc_striped(datasaurus_dozen_wide$v_lines_x, datasaurus_dozen_wide$v_lines_y) sc_striped(datasaurus_dozen_wide$dino_x, datasaurus_dozen_wide$dino_y)
This function performs the pre-processing requires to calculate the scagnostic measures. This includes the binning, outlier removal, and calculation of the alpha value.
scree(x, y, out.rm = TRUE, binner = "hex", alpha = "rahman", ...)scree(x, y, out.rm = TRUE, binner = "hex", alpha = "rahman", ...)
x, y
|
numeric vectors |
out.rm |
logical; if TRUE, iteratively trim large MST edges, If FALSE the scagnostics will be computed on the entire data set with no outlier removal. |
binner |
an optional function that bins the x and y vectors prior to triangulation Can be:
|
alpha |
character, numeric, or function. Controls the alpha radius. Valid character values are:
|
... |
other args |
An object of class "scree" that consists of three elements:
del: the Delauney-Voronoi tesselation from alphahull::delvor()
weights: the lengths of each edge in the Delauney triangulation
alpha: the radius or alpha value that will be used to generate the
alphahull
set.seed(232) x <- runif(1000) y <- runif(1000) # make scree sc0 <- scree(x,y) sc1 <- scree(x,y, out.rm = FALSE) # no outlier removal sc2 <- scree(x, y, binner = NULL) # no hexagonal binning # see the difference made by binning out out.rm draw_mst(sc0) draw_mst(sc1) draw_mst(sc2)set.seed(232) x <- runif(1000) y <- runif(1000) # make scree sc0 <- scree(x,y) sc1 <- scree(x,y, out.rm = FALSE) # no outlier removal sc2 <- scree(x, y, binner = NULL) # no hexagonal binning # see the difference made by binning out out.rm draw_mst(sc0) draw_mst(sc1) draw_mst(sc2)
These functions suggests a summary statistic that can be found using the scag calculations provided by calc_scags. The top_pair function finds the top pair of variables for each of the scagnostics, while top_scag finds the highest value scagnostic for each pair of variables. While these computations are relatively straight forward for any R user to compute themselves, including these summary function in the package simultaneously streamlines a common calculation made with the scagnostic data and suggests this summary to new users of the package.
top_pair(scags_data) top_scag(scags_data)top_pair(scags_data) top_scag(scags_data)
scags_data |
A dataset of scagnostic values that was returned by calc_scags or calc_scags_wide |
A data frame. For top_pair, each row will represent a scagnostic with its highest pair. For top_scag, each row will represent a pair of variables with its highest valued scagnostic.
calc_scags calc_scags_wide
require(dplyr) # calculate scag data scag_data <- datasaurus_dozen |> group_by(dataset) |> summarise(calc_scags(x,y, scags=c("monotonic", "outlying", "convex"))) # Calculate top_pair scag_data |> top_pair() # Calculate top_scag scag_data |> top_scag()require(dplyr) # calculate scag data scag_data <- datasaurus_dozen |> group_by(dataset) |> summarise(calc_scags(x,y, scags=c("monotonic", "outlying", "convex"))) # Calculate top_pair scag_data |> top_pair() # Calculate top_scag scag_data |> top_scag()