MBPA4.0 AdvStatsFuncSpec

Overview of Stats
Available statistics features include


 * Univariate


 * Frequency


 * Quantiles


 * Principal Component Analysis


 * Covariance

The Stats API is a programmatic API for configuring and running tasks that generate results and reports. Additioal API calls can access results.

Functional Descriptions
The main user APIs for configuring, running, and interrogating stats are the Stats, StatResults, and StatResultTable classes. These classes let you configure complex Stats task configurations using the Groovy Builder pattern (see the Groovy Builder design pattern in Groovy in Action, Ch. 8).

The Stats builder allows you to specify a data source (a dataset) and transformations, such as binners and variable library invocations, to apply before calculating stats.

The combination of raw data and transformations is called the data pipeline.

You can then configure univariate, frequency, quantile, and covariance calculations to be made on the data pipeline. You can configure the results of these calculations to generate formatted reports, print to the console or files, or the results can be output to tables (in-memory runtime objects) that you can manipulate and interrogate.

Data Pipeline
A data pipeline is a stream of records. The records are read from data sources and fields in the records can be transformed and replaced or new fields can be added to records via binners, variable libraries, and custom user code.

Data Sources
A data source block describes the path to a dataset, the optional sample weight to be used, and optional transformations to be applied to the data records.

The Stats API is able to calculate stats directly from .mbd dataset files (native files). To work with CSV, fixed width, SAS, or other non-Model Builder files, import data into a .mbd file.

Here is a simple example of a frequency calculation of the variable "age" in a census dataset. new Stats stats.configure { data(source:’file:/c:/temp/census.mbng’) freq(vars:’age’, table:’ageFreq’) }

This example reads records from the census dataset and feeds them to the frequency calculator. The frequency calculator collects frequency counts for the unique values of "age" and put the results in a StatResultTable called "ageFreq".

The options for the data element are:


 * source -- a String indicating a path to a dataset; a reference to a dataset instance
 * one or more transforms spes.

Data transformations
A data pipeline can include an arbitrary number of transformations. A transformation is added as a child element to the data command, as is this example: def ageBinner = new Binning(‘numeric’) ageBinner.configure { spec ‘(0,50]’	spec ‘(50,100+)’ } def stats = new Stats stats.configure {	data(source:’file:/c:/temp/census.mbng’) {		binner(vars:’age=ageB’, ref:ageBinner )		// Compute length of string variable, 		// save the result to new “address_len” variable		strlen(var:’address’)		// Rename “zip” variable to new “zipCode” variable         rename(vars:’zip=zipCode’)			}	freq(vars:’ageB’, table:’ageFreq’)    uni(vars:’@n(*) addressLen’, stats:’min max mean stddev’, table:’uni1’) }

This example applies a binner to the "age" variable and places the binned value in each record as a new variable "ageB". The frequency calculator then computes counts of the binned values and generates a table called "ageFreq".

The options for configuring binners are:


 * vars – A String describing a list of variables to apply the binner to. The user has the option to define the binned variable name via the = . If the option is omitted, the binned variable will be added to the record with the suffix "_binned".
 * ref – a reference to a Binner instance; or a String naming a globally available system Binner such as "GeometricFine".
 * use – this option configures the return type from the binner. By default, the binner label is used. The available options are: ’ label ’, ’ index ’ , ’ numeric ’
 * suffix – use suffix to override the default suffix "_binned" for result variables for binning.

Some system-provided transforms includes:


 * strlen – Calculate the length of desired variable of type string. The newly created variable that hold strlen calculation will have default suffix "_len".
 * copy – Duplicate a variable to new variable. The newly created variable that is a duplicate of desired variable(s) will have default suffix "_copied".
 * rename – Rename a vairable. The current variable will be removed. The newly created variable will have default suffix "_renamed".
 * zscale – Zscale a variable with provided mean, and stddev. The newly zscale variable will have a default suffix "_Z". Other required settings for zscale include "mean" and "stddev" which take numeric values that are used in zscale calculation.

All above transforms accept the standard configurations:


 * vars :  as it is defined for binner transform.
 * suffix : accept a string used to override the default suffix of that transform.

Weight Specifications
For Stats weights, you can simply specify the variable that will have weight value for each record. The weight is applied for all stat table calculations, as this example illustrates: def stats = new Stats stats.configure { data(source:’file:/c:/temp/census.mbng’) { }	weight ‘sampleWeightVar’ freq(vars:’ageB’, table:’ageFreq’) uni(vars:’@n(*)’, stats:’min max mean stddev’, table:’uni1’) }

Univariate Configuration
You can configure and run the Stats task to calculate univariate statistics. This includes min, max, mean, variance, standard deviation, kurtosis, and skewness of a series of numerical data.

The command for specifying univariate stats is uni. Each argument to the uni command is a name/value pair. The uni command has these options:


 * vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
 * stats – the value for this option is a String containing a space delimited list of stat names. If no stats option is specified, all available stats will be computed. The available univariate stats are:
 * min – the minimum value of all the non-missing data in this column
 * max – the maximum value of all the non-missing data in this column
 * mean – the mean (or average) of all the non-missing data in this column
 * stddev – the standard deviation of all the non-missing data in this column
 * variance - the standard deviation of all the non-missing data in this column
 * kurtosis - the standard deviation of all the non-missing data in this column
 * skewness - the standard deviation of all the non-missing data in this column
 * count – the total count of all records read
 * countNum – the count of all non-missing values in this column
 * countMiss – the missing value count in this column
 * pctNum – the percentage of non-missing values (countNum / count * 100)
 * pctMiss – the percentage of missing values (countMiss / count * 100)
 * by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
 * table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
 * report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.

Here is a simple example: def stats = new Stats results = stats.configure { // specify the URL to the data file containing the // numeric variables‘income’ and ‘age’ data(source:’file:/c:/temp/census.mbng’)

// calculate univariate stats for age and income and // put the results in a table called ‘ageIncome’ and // generate an HTML report uni(vars: ‘age income’, stats:’min max’, table:’ageIncome’, report:’c:/work/jdoe/reports/ageIncome.html’) } def results = stats.run

// retrieve the results Table ageIncomeTable = results.ageIncome println ageIncomeTable

output:

var		min	max --- age		10	90 income	0	100000

Univariate statistics tables
Univariate stats can be programmatically accessed via the StatResultTable. A StatResultTable is a list of rows where each variable selected in the uni command is a row and each column contains a univariate calculation.

You can then access individual univariate calculations by selecting a row and accessing a column. Using the ageIncome table generated above, you can access a values like this:

// select the row where var column equals ‘age’ def ageRow = ageIncomeTable.find{row -> row.var == ‘age’} println “age min: “ + ageRow.min println “age max: “ + ageRow.max

output:

age min: 10 age max: 90

// select the row where var column equals ‘income’ def incomeRow = ageIncomeTable.find{row -> row.var == ‘income’} println “income min: “ + incomeRow.min println “income max: “ + incomeRow.max

output:

income min: 0 income max: 100000

Univariate by-table results
Univariate calculations can also be done with by variables. Each row is a combination of the by variable unique value combinations and the univariate results. For example,

def stats = new Stats stats.configure { // specify the URL to the data file containing the // numeric variables‘income’ and ‘age’ data(source:’file:/c:/temp/census.mbng’)

// calculate univariate stats for age and income and // put the results in a table called ‘ageIncome’ and // generate an HTML report uni(vars: ‘income acctBal’, by: ‘age’, stats:’min max mean’, table:’ageIncome’) }

def results = stats.run def ageIncomeTable = results.ageIncome meanIncome = ageIncomeTable.find{it.age == 15 && it.var == ‘income’ }.mean println ‘mean income of 15 year olds: ‘ + meanIncome

meanAcctBal = ageIncomeTable.find{it.age == 15 && it.var == ‘acctBal’ }.mean println ‘mean account balance of 15 year olds: ‘ + meanAcctBal

The results of a by table are always tables within tables. Each by-level permutation defines a unique row in the table. The sub-table is then accessed via the .table accessor method. The contents of the univar table then has one row per variable specified and one column per stat specified. Here is an example,

Contents of ageIncomeTable

Frequencies
You can configure and run the Stats task to calculate frequency statistics. Frequencies are counts of the occurrences of unique values of variables.

Frequency Configuration
The command for specifying frequency stats is freq. Each option for the freq command is a name/value pair. The freq command has these options:


 * vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
 * by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
 * table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
 * report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.

Here is an example of configuring and running a frequency calculation.

def stats = new Stats stats.configure { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’)

// calculate freq stats for US states and put the results // in a table called ‘stateFreq’ and pretty print the results to HTML. freq(vars: ‘US_state’, table:’stateFreq’, report:’/work/jdoe/stateFreq.html’) }

def results = stats.run

// retrieve the results def stateFreq = results.stateFreq

Frequency Table Results
Frequency tables are list of rows, where each row contains the variable name, value and count. To access individual cells in the table, you need to find the matching row in the table. For example, using the table generated above, the user can access the VT frequency count like this:

def row = stateFreq.find{ it.value == ‘VT’ } printlnt “VT count: “ + row.count

output:

VT count: 100

If you have created a table with multiple vars, the results are a bit more complex. For example,

def stats = new Stats stats.configure { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’) freq(vars: ‘US_state age’, table:’ageStateFreq’) } def results = stats.run Table ageStateTable = results.ageStateFreq def ageResults = ageStateTable.findAll{ it.var == ‘age’ } def ageRow = ageResults.find{ it.value == ‘20’ } println “age count where age is 20: “ + ageRow.count

output:

age count where age is 20: 53

Raw contents of ageStateFreq table

Frequency by-table results
By table results are also placed into StatResultTable. Like univar by-var tables, each row is uniquely defined by the permutations of the by var levels. Each row then has a table cell containing the frequency table for the vars specified. For example,

Contents of stateByAge table

You can then query the table to find individual rows to access counts:

Here is an example with two by variables:

Contents of stateByAgeByIncome table

You can then query the table to find individual rows to access counts:

Multidimensional frequency tables
Multidimensional tables let you see joint distributions of two or more variables. In the 2-dimensional case, the additional values of pct, pctCol, and pctRow are calculated. Respectively, these values are the percentage of total counts, percentage of column, and percentage of row.

This example shows the joint distribution of zip1 and reactivated.

’ zip1Reactivated ’ Raw Results:

The raw data can be accessed with scripts like this:

Sample report

Zip1, Mailed, file: ’ /work/data/postmail.mbng ’

For 3 or more dimensions, the only available value is the raw count. For example,

Raw results

Multidimensional frequency tables with by variables
Multidimensional frequency tables allow the user to subdivide a dataset with by variables. Here is an example of how to create a 2d freq table by income. The results are a table of tables.

Sample output of zipTable:

Code example of accessing 2d freq table with by var

Quantiles
Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. The system will provide a set of pre-canned quantiles which include: default, percentiles (100-quantiles), quartiles (4-quantiles), duo-deciles (20-quantiles), tails (as mapped to those provided by classic ModelBuilder).

Quantile Configuration
The primary end user API for quantiles will be the quantile element in the Groovy Stats builder.

The following keywords are supported by quantile:


 * table: take a String value to specify the named table that hold quantile result. This is optional. If table is not defined or omitted, that quantile table result will be assigned a default name with prefix "quantile" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
 * vars: take a collection of input variables or filter expression for number of input variables that will be included in this quantile table.
 * by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
 * bounds: defines the bounds for quantile tables. This values can be
 * String :  as name of one of system-provided quantile bounds. The name of these predefined quantile bounds can be as short as the first three letters of the actual name. For example, user may wish to enter "per" instead of "percentiles".
 * String :  contains space-delimited doubles to specify on-fly bounds by users.
 * Bounds instance : re-usable Bounds instance defined by users.
 * report: take a String in URI format to specify the location where report will be generated for quantile result.

For example:

Figure 4

Quantile Table Results
The result for each quantile computation is also placed in StatResultTable with a name. The result is also a table itself whereas the quantile result is placed under column named quantileResult. If the quantile has by-var, it will have additional columns for by-var variables.

Quantile Result table without by-var

Quantile Result table with by-var

Each quantileResult instance can be viewed as a table with at least two or more columns where the first column is "bounds", and each additional column is named with variable with quantile calculations. Access the value in "bounds" or each variable column will yields an array of double coresponding to the quantile bound and its value for each variable.

To quickly access quantile result, user can simply invoke a println statement on the table result such as

println results.spending

println results.quantile1

Here ’ s an example for the raw result produced by println statement :

For a quantile with by-variable, result will be printed as:

Covariance
The user can compute variable covariance with the covar element in a Stats task. The covar element supports weight variables, by-variables, and binners on both by-variables and input variables. Optionally, on symmetric correlation matrices, user can also perform PCA as well.

Covariance Configuration
The following keywords are supported in covariance.


 * table: take a String value to specify the named table that hold covariance result. This is optional. If table is not defined or omitted, that covariance table result will be assigned a default name with prefix "covar" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
 * report: take a String in URI format to specify the location where report will be generated for quantile result.
 * rows: take a collection of input variables or filter expression for number of input variables that will be used as row variables.
 * cols: take a collection of input variables or filter expression for number of input variables that will be used as column variables. If this is omitted, it will be the same as those variables defines "rows", and covariance will be symmetric.
 * by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
 * pca: accepts the strings ( ’ yes ’, ’ no ’ ), booleans (true, false) or numbers (1, 0) to compute a PCA result when the input variables are symmetric. When the input is symmetric, and pca is omitted, no PCA result is generated.
 * covar: accepts the strings ( ’ yes ’, ’ no ’ ), booleans (true, false) or numbers (1, 0). If this value is true, the PCA result (if generated) will be computed with the covariance matrix instead of the default correlation matrices. The default is false.
 * missing : take input String ( ’ drop ’, ’ pairwise ’ ) to indicate how to handle record with missing variables. The default behavior is dropping record with missing variables.

The following example illustrates how to use covar to produce simple symmetric covariance matrix for three variables ’ A C D ’. The selection of rows and columns works just like the SLIM vars list in frequency and univariate statistics. For example,

The following example illustrates how to configure a covariance generator to do PCA (Principal Component Analysis). The ’ pca ’ option indicates a PCA analysis will be included in the result if and only if the covariance is symmetric. Because a table name is omitted, the Stats task will place the results in a table with an auto-generated name of "covar1".

This example shows how to configure a covar table with by-variables and the missing value handling as "pairwise".

By default if only a rows option is provided, the covar matrices will be symmetric. If the user wants to configure asymmetric covariance they must provide a columns list. This is done with the "cols" option. If the table is asymmetric, PCA will not be computed and the pca option will be ignored. The following example shows an asymmetric covar declaration with a single by-variable.

Covariance Results
The result for each covar request is also placed in StatResultTable with a name. The result is also a table itself whereas the covariance result is placed under column named covarResult. If the covar supports by-var, it will have additional columns for by-var variables.

Covariance Result table without by-var

Covariance Result table with by-var

Each CovarianceResult instance is also a table with the followings columns :

Each PCA instance is also a table with n + 5 columns, where n is the number of by-var variables.

PCA without by-var

PCA with by-var

The following example shows how to access covariance instance through the result table and its pca

A quick snapshot of covariance result included here:

If PCA is produced as part of covar statement

Principal Component Analysis Projection
The PCA projection is a system-provided transform. You can project PCA as part of data transform in either Stat command or using the Apply task. The PCA projection needs a PCA result from the covariance calculation. You must specify the number of eigenvectors (subspace size) to use when computing the main projection vector using the eigenvectors with the largest eigenvalues. The result vectors (main, coeff, and residual) of PCA can be either flatted out (by default) or kept as numeric array.

PCA Projection Configuration
The following keywords are supported in PCA projection.


 * data. The data node is used to load the target dataset. The argument for this data section is very similar to those accepted by stats. Using a system-provided transform, you can also rename or copy the variable on-the-fly so the new dataset can work with expected variables from the PCA result.


 * project. Projection configuration options include:
 * ref : take a PCA result from covariance analysis.
 * vectors (v) : take a numeric value to indicate the subspaces.
 * compute (c) : compute lists (what will be included in the output). It takes any combinations of c (coefficients), r  (residuals), m (main vectors)
 * zscaling (z) : take boolean value [ yes/no, true/false, 1/0 ] to indicate if zScaling should be on. Default is on.
 * normalize (n) : take a boolean value [ yes/no, true/false, 1/0 ] to define if normalization should be applied. Default is to apply.
 * print: take a boolean value [ yes/no, true/false, 1/0 ] to indicate if a summary should be displayed at the end.. The default is to display summary.
 * flat: take a boolean value [ yes/no, true/false, 1/0 ] to indicate if projection result should be flatted out as individual (double) variable for each component, or if flat value is false, projection result should be output as double [] for vairables : mains, coefficients, and residuals. The default is to flat out (flat = true).

The output dataset contains all original variable + projection values in which

Flat option = true (by default)

or

Flat option = false

The following example shows how to project a PCA projection on to a dataset using a pre-calculated PCA instance. This example makes use of all available options.The dataset can be the original dataset is used to calculate eigenvectors or a new dataset. A new dataset will not have names for those variables used in PCA calculations, so be sure to rename variables in the new dataset to map the expected variables that are in PCA calculation.

Here ’ s another example for projecting PCA with all default options: