Since nonlinear PCA is an unsupervised model,
standard techniques for model selection, including
cross-validation or more generally the use of an independent test set, fail when applied to nonlinear PCA.
But instead we can validate the complexity of the nonlinear PCA model by using the
error in missing data estimation as a criterion for model selection.
This is motivated by the idea that only the model of optimal complexity
is able to predict missing values with the highest accuracy.
Keywords: model selection, model complexity, validation
1) split your data into "traindata" and "testdata"
% example: Gaussian data (linear data) traindata = randn(2,1000); testdata = randn(2,1000);
2) choose a specific model complexity and train the nonlinear PCA model (download nonlinear PCA)
weightdecay = 0.001 [c,net,network]=nlpca(traindata,1,... 'mode' ,'symmetric' ,... 'type' ,'inverse' ,... 'units_per_layer' ,[ 1 , 6 , size(traindata,1) ],... 'weight_decay' ,'yes' ,... 'weight_decay_coefficient', weightdecay ,... 'max_iteration' ,5000);
3) get the validation error based on missing data estimation
% set randomly one value per sample-column as missing [s,idx]=sort(rand(size(testdata))); testdataNaN=testdata; testdataNaN(idx==1)=NaN; % reconstructing test data including missing values pc_test = nlpca_get_components(net,testdataNaN); data_recon=nlpca_get_data(net,pc_test); e = (data_recon-testdata).^2; % as validation error, we only use the missing data reconstruction error testerrorNaN = mean(e(isnan(testdataNaN)));
Classical train and test error cannot be used for validation as shown in Scholz (2012).
Please use the following lines only as comparison and not for validating the nonlinear PCA model.
% get classical train error data_recon=nlpca_get_data(net); e = (data_recon-traindata).^2; trainerror = mean(mean(e)); % get classical test error pc_test = nlpca_get_components(net,testdata); data_recon=nlpca_get_data(net,pc_test); e = (data_recon-testdata).^2; testerror =mean(mean(e));