Title: | Determine How Representative Two Multidimensional Data Sets are |
---|---|
Description: | Compute the values of various parameters evaluating how similar two multidimensional datasets' structures are in multidimensional space, as described in: Jouan-Rimbaud, D., Massart, D. L., Saby, C. A., Puel, C. (1998), <doi:10.1016/S0169-7439(98)00005-7>. The computed parameters evaluate three properties, namely, the direction of the data sets, the variance-covariance of the data points, and the location of the data sets' centroids. The package contains workhorse function jrparams(), as well as two helper functions Mboxtest() and JRsMahaldist(), and four example data sets. |
Authors: | Harmen Draisma [aut, cre] |
Maintainer: | Harmen Draisma <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-10-31 21:07:07 UTC |
Source: | https://github.com/cran/represent |
This package contains the workhorse function jrparams(), as well as two helper functions Mboxtest() and JRsMahaldist(), and four example data sets. The jrparams() function computes the values of three types of parameters that assess the representativity of two multidimensional data sets. These parameters and the example data sets are described in a publication by Jouan-Rimbaud et al (1998).
Package: | represent |
Type: | Package |
Version: | 1.0.1 |
Date: | 2023-11-02 |
License: | GPL-3 |
# |
Harmen Draisma
Maintainer: Harmen Draisma <[email protected]>
Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
One of two 50 x 5 data sets as mentioned on p. 135 of Jouan-Rimbaud et al (1998). See DATASET.2 for the other 50 x 5 data set.
data(DATASET.1)
data(DATASET.1)
The format is: num [1:50, 1:5] 19.851 7.526 2.123 0.945 0.726 ...
Variable 1: 50 values uniformly distributed between 0 and 20 + noise. Variable 2: 50 values uniformly distributed between 2 and 20. Variable 3: 50 values uniformly distributed between 10 and 20. Variable 4: 50 values uniformly distributed between 5 and 20. Variable 5: 50 values uniformly distributed between 6 and 20.
Page 135 of: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
data(DATASET.1)
data(DATASET.1)
One of two 50 x 5 data sets as mentioned on p. 135 of Jouan-Rimbaud et al (1998). See DATASET.1 for the other 50 x 5 data set.
data(DATASET.2)
data(DATASET.2)
The format is: num [1:50, 1:5] 2.72 12.05 6.5 12.27 16.03 ...
Variable 1: 50 values uniformly distributed between 0 and 20 + noise. Variable 2: 50 values uniformly distributed between 2 and 20. Variable 3: 50 values uniformly distributed between 10 and 20. Variable 4: 50 values uniformly distributed between 5 and 20. Variable 5: 50 values uniformly distributed between 6 and 20.
Page 135 of: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
data(DATASET.2)
data(DATASET.2)
One of two 50 x 10 data sets as mentioned on p. 135 of Jouan-Rimbaud et al (1998). See DATASET.4 for the other 50 x 10 data set.
data(DATASET.3)
data(DATASET.3)
The format is: num [1:50, 1:10] 19.851 7.526 2.123 0.945 0.726 ...
Variable 1: 50 values uniformly distributed between 0 and 20 + noise. Variable 2: 50 values uniformly distributed between 2 and 20. Variable 3: 50 values uniformly distributed between 10 and 20. Variable 4: 50 values uniformly distributed between 5 and 20. Variable 5: 50 values uniformly distributed between 6 and 20. Variables 6-10: five variables with values uniformly distributed between -0.5 and +0.5
Page 135 of: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
data(DATASET.3)
data(DATASET.3)
One of two 50 x 10 data sets as mentioned on p. 135 of Jouan-Rimbaud et al (1998). See DATASET.3 for the other 50 x 10 data set.
data(DATASET.4)
data(DATASET.4)
The format is: num [1:50, 1:10] 2.72 12.05 6.5 12.27 16.03 ...
Variable 1: 50 values uniformly distributed between 0 and 20 + noise. Variable 2: 50 values uniformly distributed between 2 and 20. Variable 3: 50 values uniformly distributed between 10 and 20. Variable 4: 50 values uniformly distributed between 5 and 20. Variable 5: 50 values uniformly distributed between 6 and 20. Variables 6-10: five variables with values uniformly distributed between -0.5 and +0.5
Page 135 of: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
data(DATASET.4)
data(DATASET.4)
This function computes three types of parameters to assess the representativity of two multidimensional data sets by a comparison of their structure. Representativity is expressed as similarity of: I) principal component analysis (PCA) loadings patterns; II) variance-covariance matrix structures; III) data set centroid locations. All parameters are computed in principal component (PC) space. These parameters are described in a publication by Jouan-Rimbaud et al (1998).
jrparams(BLOCK.1,BLOCK.2,ncomp=min(c(dim(BLOCK.1),dim(BLOCK.2))),Cscrit=0.6,Rscrit=0.6)
jrparams(BLOCK.1,BLOCK.2,ncomp=min(c(dim(BLOCK.1),dim(BLOCK.2))),Cscrit=0.6,Rscrit=0.6)
BLOCK.1 |
First multivariate data set (a numeric matrix) |
BLOCK.2 |
Second multivariate data set (a numeric matrix), to be compared with the first |
ncomp |
The number of PCs to compute the parameter values for |
Cscrit |
The value of the "C*" parameter corresponding to the value of Box's M statistic being equal to its critical value |
Rscrit |
The value of the "R*" parameter corresponding to the Mahalanobis distance being equal to its critical value |
For argument 'ncomp', the default is based on the smallest number of rows or columns (whichever is smaller) in either of both data sets to be compared. This number should be a proxy for the minimum of the 'ranks' (i.e., the actual dimensionalities) of both data sets.
The default settings for the values of arguments 'Cscrit' and 'Rscrit' correspond to the values as recommended by Jouan-Rimbaud et al (1998) in their equations (9a) and (13a), respectively.
A numeric matrix with rows containing the computed values for in total six parameters that are described in Jouan-Rimbaud et al (1998). The nomenclature for the parameters as in that publication has been adopted here. Hence, the first two rows ("P" and "P*") of the output are informative of the similarity of the PCA loadings patterns of both data sets. Rows 3 and 4 ("C" and "C*", respectively) are indicative of the similarity of the variance-covariance matrices. Finally, rows 5 and 6 ("R" and "R*") represent the similarity of the data set centroid locations. For all parameters, values equal to 1 indicate perfect similarity. The number of columns of the output matrix depends on the value of 'ncomp'.
Unexpected results might occur if the two data sets to be compared are of different rank, and the number of principal components to retain has not been passed to jrparams() as well (not tested).
The function performs principal component analysis itself, so one can just input the original data sets (containing the original manifest variables). In general it is wise to compute the parameter values only for the significant principal components. Significance of principal components for both data sets to be compared can be assessed using e.g. scree plots, as available for instance in the 'psych' package.
Harmen Draisma
Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
#Load example data sets, 50 observations x 5 variables data(DATASET.1) data(DATASET.2) #Assess representativity using all principal components #(default; will be fine if both sets are of equal rank) jrparams(DATASET.1, DATASET.2) #Positive control: check similarity of DATASET.1 with itself #(values for all parameters should be unity) jrparams(DATASET.1, DATASET.1)
#Load example data sets, 50 observations x 5 variables data(DATASET.1) data(DATASET.2) #Assess representativity using all principal components #(default; will be fine if both sets are of equal rank) jrparams(DATASET.1, DATASET.2) #Positive control: check similarity of DATASET.1 with itself #(values for all parameters should be unity) jrparams(DATASET.1, DATASET.1)
Computes Mahalanobis distance according to Section 2.5.1 in Jouan-Rimbaud et al (1998). The distance takes into account both the distance between the centroids of two data sets, as well as the dispersion of the data around these centroids. This function is used by the function jrparams() to assess representativity of two multidimensional data sets.
JRsMahaldist(DATA)
JRsMahaldist(DATA)
DATA |
Matrix containing the weighted scores on the principal components (PCs) that have been computed for the two data sets to be compared. The first column of this matrix should contain a group indicator variable, which has a value equal to 1 for the first data set and a value equal to 2 for the second data set. The remaining columns contain the weighted PC scores for the two data sets. |
A list type object containing one field named "Ds", a 1*1 matrix type object whose only element has the value of the Mahalanobis distance.
The Mahalanobis distance is computed using the pooled variance-covariance matrix as defined in Section 2.4 of Jouan-Rimbaud et al (1998), and hence may differ somewhat from a 'regular' Mahalanobis distance as computed using e.g. the function mahalanobis() from the 'stats' package.
Harmen Draisma
Section 2.5.1 in: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
jrparams(); MBoxtest()
Computes Box's M-statistic according to Section 2.4 in Jouan-Rimbaud et al (1998). This statistic is used to compare the structures of the variance-covariance matrices of two multidimensional data sets. This function is used by the function jrparams().
MBoxtest(DATA, nmanvars)
MBoxtest(DATA, nmanvars)
DATA |
Matrix containing the weighted scores on the principal components (PCs) that have been computed for the two data sets to be compared. The first column of this matrix should contain a group indicator variable, which has a value equal to 1 for the first data set and a value equal to 2 for the second data set. The remaining columns contain the weighted PC scores for the two data sets. |
nmanvars |
Number of manifest variables in the original multidimensional data sets. |
A list with two elements:
MB |
Box's M-statistic |
Sp |
Pooled covariance matrix |
Harmen Draisma
Section 2.4 in: Jouan-Rimbaud D, Massart DL, Saby CA, Puel C: Determination of the representativity between two multidimensional data sets by a comparison of their structure. Chemometrics and Intelligent Laboratory Systems 40 (1998) 129-144.
jrparams()