Models
The Models tab groups the following Scenario Datasheets:
- Boosted Regression Tree
- GLM (Generalized Linear Model)
- Maxent
- Random Forest
- Model Outputs
In the SyncroSim UI, the Models tab can be accessed by right-clicking on a WISDM Scenario and selecting Properties from the context menu.
Boosted Regression Trees
The Boosted Regression Trees Datasheet contains information about the Boosted Regression Tree (BRT) algorithm, which is a machine-learning model and is described in and uses code from Elith et al., 2008 and Valavi et al., 2021. Arguments for this model correspond with model fitting inputs for the gbm.step function in the dismo R package, and defaults were set based on Valavi et al., 2021.
Fitting Method
The BRT fitting process can either be more user-specified or default-based. This argument allows the user to specify whether default values or user-specified values should be used in the fitting process for BRT. If this argument is set to “Use defaults and tuning”, the code will tune the learning rate and number of trees to find values that fit a model to the full dataset and all CV splits. If the argument is set to “Use values provided below”, tuning does not occur and only the values provided in the below arguments will be used for model fitting.
Learning Rate
A number between 0 and 1 that sets the weight applied to individual trees. A small learning rate means that each tree will have a smaller contribution to the overall model.
Number of trees added per stage
Sets the number of initial trees, or trees added to the previously grown trees per round of model tuning.
Bag Fraction
Sets the proportion of observations used in selecting variables.
Maximum number of trees
Sets the maximum number of trees to fit before stopping.
GLM
The GLM Datasheet contains information about the Generalized Linear Model (GLM) algorithm options.
Select Best Predictors
Determines whether to the GLM should use internal statistical methods to determine which predictors are significant to the model and remove predictors that are unimportant for the Scenario. If Select Best Predictors is set to “No”, the GLM will use all predictors offered to it.
Simplification Method
Defines how the predictors are chosen for the GLM. Two options are available: “AIC” (Akaike Information Criterion) or “BIC” (Bayesian Information Criterion). AIC and BIC measure how well the GLM fits the data based on the covariates selected during the covariate selection step, and evaluates changes to the criteria when adding or dropping covariates. For more information about these criteria, see Aho, Derryberry, & Peterson.
Consider Squared Terms
Determines whether the GLM should consider relationships between the predictors and the response in a non-linear relationship by squaring the predictor values.
Consider Interactions
Determines whether the GLM should take into account interactions among covariates during the modeling process.
Maxent
The Maxent Datasheet contains information about the Maxent algorithm, a machine learning technique that minimizes relative entropy between the probability density of the species and of the environment.
Memory allocation (GB)
Sets the memory that should be allocated to the Maxent algorithm, in gigabytes.
Maximum number of background points
Sets the maximum number of background points that will be generated during the modeling process for the Maxent algorithm. More background points will lead to a higher entropy of the model, and only one background point can be generated per raster cell.
Number of processing threads
Sets the number of processing threads used for the Maxent algorithm. If this argument is not specified, the number of processing threads for Maxent will be set relative to the number of cores on the machine on which the Scenario is being run.
View maxent user interface during run
Determine whether to show the Maxent user interface.
Save maxent formatted input/output files
Determines whether to save Maxent’s input and output files to the library folders.
Random Forest
The Random Forest Datasheet contains information about the Random Forest (RF) algorithm - a machine-learning ensemble classifier. More information about Random Forest can be found in the randomForest R package documentation.
Evaluate covariate importance
Determines whether the RF algorithm should evaluate covariate importance during the modeling process by calculating the change in fit statistics for trees that include each covariate.
Calculate casewise importance
Determines whether the RF algorithm should calculate the importance of each covariate during the classification process and measure how significantly covariates influence the output. For more information about covariate and casewise importance in the RF algorithm, see Classification and Regression with Random Forest.
Number of variables sampled at split
Optional. Sets the number of variables randomly sampled as candidates in each split of the random forest.
Maximum number of nodes
Optional. Sets the maximum number of nodes the RF algorithm can have.
Number of trees
Sets the number of trees for the RF algorithm.
Node size
Optional. Sets the size of the terminal nodes. A larger node size causes smaller trees to be grown and will take less time.
Normalize votes
Determines whether each decision tree’s vote should be divided by the total number of votes to ensure that the sum of the votes is 1. Votes will be expressed as fractions.
Calculate proximity
Determines whether proximity measures should be calculated. Proximity measures are based on the frequency that pairs of data points are in the same terminal nodes.
Sample with replacement
Determines whether each decision tree should be trained on a bootstrap sample of the data, so some observations can be repeated or left out.
Model Outputs
The Model Outputs Datasheet contains information about the outputs of the models. These outputs will be visible in the Maps tab in the lower left panel when the Scenario’s results have been added and can be exported from the Export tab. The outputs may include:
- Response Curves
- Standard Residuals Plots
- Residual Smooth Plot
- Residual Smooth RDS
- Text Output
- Calibration Plot
- ROC/AUC Plot
- AUCPR Plot
- Confusion Matrix
- Variable Importance Plot
- Variable Importance Data
- Maxent Files
Note that not all outputs are generated per model run. The outputs that are generated depend on the type of model selected and the type of field data being used (e.g., presence/absence, count).
Model RDS
Defines the names of the model .rds files (which hold R objects) for algorithms that have completed execution.
Response Curves
Model Response Curves show the relationship between each predictor included in the model and the fitted values of the model. In general, these response curves should be smooth positive and negative relationships in agreement with the biological relationships that occur in the environment. While some exceptions occur, bumpy response curves can indicate the need for an algorithm to be optimized.
Standard Residuals Plots
Model Residuals plots show the plotted model deviance residuals for the algorithm.
Residual Smooth Plot
Model Residual Smooth Plots show the spatial relationship between the model deviance residuals. With the assumption that the residuals will be independent from each other, a spatial pattern in the deviance residuals could indicate an issue with the model fit, and these patterns can be seen in clusters of higher or lower residuals (Dormann et al., 2007).
Residual Smooth RDS
Contains a file path to the .rds file holding the information pertaining to the Residual Smooth Plots.
Text Output
The Text Outputs contain information about the algorithm’s settings and results. For example, the text output for GLM contains information about the model family and simplification method used as well as the number of covariates in the final model and evaluation statistics, along with other information.
Calibration Plot
The Calibration Plot shows the relationship between the predicted values from the model and the actual observations from the test proportion of the data. Calibration plots showing a higher agreement between the predicted probability of presence and the probability of presence indicate good calibration of the model, but the AUC of the model should be considered along with the calibration plot (Pearce & Ferrier, 2000; Elith et al., 2010).
ROC/AUC Plot
The ROC/AUC Plots show the relationship between sensitivity (true positives) and specificity (False positives) in the model algorithm. The output of these curves depend on whether the Split Data For Model Training and Testing and Use Cross Validation for Model Selection arguments were set to “Yes” in the Validation Options datasheet in the Data Preparation tab. It shows the sensitivity versus specificity of the training split and testing split or the training split and cross validation mean, along with AUC values. Better curves are generally those that arch far above the diagonal of the plot and those that have smaller discrepancies between the training and testing/validation data.
AUCPR Plot
The AUCPR plots show the relationship between recall (True positives / (True positives + False negatives)) and precision (True positives / (True positives + False Positives)) of the algorithm along with the Training split AUC and cross validation mean AUC. For more information about ROC/AUC and AUCPR plots, see (Davis & Goadrich, 2006).
Confusion Matrix
The Confusion Matrix Shows the number of values classified by the algorithm as a presence or an absence compared to the observed number of presences and absences in the data. The algorithm will output a confusion matrix for the train data, and a confusion matrix for the cross validation or test data. Each matrix consists of 4 cells, which predicted values on the left and observed values on the bottom. The top left cell shows the number of values predicted to be a presence that were observed to be a presence. The top right cell shows the number of values predicted to be a presence that were actually observed as absences. The bottom left cell shows the number of values predicted to be an absence that were actually observed as presences. The bottom right cell shows the number of values predicted to be an absence that were also observed to be absences. These values contribute to the calculation of the statistics at the bottom of the matrix: percent correctly classified, sensitivity, specificity, true skill statistic, and Cohen’s Kappa. The values in the Confusion Matrix can help identify how well or poorly the algorithm has made its predictions.
Variable Importance Plot
The Variable Importance Plot shows the relative influence of each predictor in the model. Variable importance is calculated by permuting the values of the predictors in the dataset and predicting the model to the new dataset with the permuted predictor values and measuring the mean drop in the AUC value using 5 different permutations of the predictor. The importance is measured by the change in AUC when each predictor is permuted and appears on the x-axis with individual variables on the y-axis. The importance is shown for the cross-validation, train, and test data.
Variable Importance Data
Exporting the Variable Importance Data outputs a .csv file containing variable importance values for the train data, test data, and each cross validation split.
Maxent Files
The Maxent Files are files that are output while the Maxent algorithm is running and are output in a .zip folder.