Grupo de Ingeniería Estadística Multivariante GIEM

Loading...
OrgUnit Logo
Date established
City
Country
ID
Description

Publication Search Results

Now showing 1 - 6 of 6
  • Publication
    PCA model building with missing data: New proposals and a comparative study
    (Elsevier, 2015-08-15) Folch-Fortuny, Abel; ARTEAGA MORENO, FRANCISCO JAVIER; Ferrer Riquelme, Alberto José; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; Ministerio de Ciencia e Innovación; Ministerio de Economía y Competitividad
    [EN] This paper introduces new methods for building principal component analysis (PCA) models with missing data: projection to the model plane (PMP), known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR). These methods are adapted from their PCA model exploitation version to deal with the more general problem of PCA model building when the training set has missing values. A comparative study is carried out comparing these new methods with the standard ones, such as the modified nonlinear iterative partial least squares (NIPALS), the it- erative algorithm (IA), the data augmentation method (DA) and the nonlinear programming approach (NLP). The performance is assessed using the mean squared prediction error of the reconstructed matrix and the cosines between the actual principal components and the ones extracted by each method. Four data sets, two simulated and two real ones, with several percentages of missing data, are used to perform the comparison. Guardar / Salir Siguiente >
  • Publication
    PLS model building with missing data: New algorithms and a comparative study
    (John Wiley & Sons, 2017-07) Folch-Fortuny, Abel; Arteaga, Francisco; Ferrer Riquelme, Alberto José; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; Ministerio de Economía y Competitividad; Ministerio de Ciencia e Innovación
    [EN] New algorithms to deal with missing values in predictive modelling are presented in this article. Specifically, 2 trimmed scores regression adaptations are proposed, one from principal component analysis model building with missing data (MD) and other from partial least squares regression model exploitation with missing values. Using these methods, practitioners can impute MD both in the explanatory/predictor and the dependent/response variables. Partial least squares is used here to build the multivariate calibration models; however, any regression method can be used after MD imputation. Four case studies, with different latent structures, are analysed here to compare the trimmed scores regression¿based methods against state-of-the-art approaches. The MATLAB code for these methods is also provided for its direct implementation at http://mseg.webs.upv.es, under a GNU license.
  • Publication
    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: Practical Aspects
    (Elsevier, 2014-02-15) Camacho Páez, José; Ferrer Riquelme, Alberto José; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; Ministerio de Ciencia e Innovación
    This is the second paper of a series devoted to provide theoretical and practical results and new algorithms for the selection of the number of Principal Components (PCs) in Principal Component Analysis (PCA) using crossvalidation. The study is especially focused on the element-wise k-fold (ekf), which is among the most used algorithms for that purpose. In this paper, a taxonomy of PCA applications is proposed and it is argued that cross-validatory algorithms computing the prediction error in observable variables, like ekf, are only suited for a class of applications. A number of cross-validation methods, several of which are original, are compared in two applications of this class: missing data imputation and compression. The results showthat the ekf is especially suited for missing data applications while other traditional cross-validation methods, those by Wold and Eastment and Krzanowski, are not found to provide useful outcomes in any of the two applications. These results are of special value considering that the methods investigated are computed in the main commercial software packets for chemometrics. Finally, the choice of the missing data algorithm within ekf is also investigated.
  • Publication
    Enabling network inference methods to handle missing data and outliers
    (BioMed Central, 2015-09-03) Folch-Fortuny, Abel; Fernández Villaverde, Alejandro; Ferrer Riquelme, Alberto José; Rodríguez Banga, Julio; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; European Commission; Ministerio de Ciencia e Innovación; Xunta de Galicia; Ministerio de Economía y Competitividad
    [EN] Background: The inference of complex networks from data is a challenging problem in biological sciences, as well as in a wide range of disciplines such as chemistry, technology, economics, or sociology. The quantity and quality of the data greatly affect the results. While many methodologies have been developed for this task, they seldom take into account issues such as missing data or outlier detection and correction, which need to be properly addressed before network inference. Results: Here we present an approach to (i) handle missing data and (ii) detect and correct outliers based on multivariate projection to latent structures. The method, called trimmed scores regression (TSR), enables network inference methods to analyse incomplete datasets by imputing the missing values coherently with the latent data structure. Furthermore, it substitutes the faulty values in a dataset by proper estimations. We provide an implementation of this approach, and show how it can be integrated with any network inference method as a preliminary data curation step. This functionality is demonstrated with a state of the art network inference method based on mutual information distance and entropy reduction, MIDER. Conclusion: The methodology presented here enables network inference methods to analyse a large number of incomplete and faulty datasets that could not be reliably analysed so far. Our comparative studies show the superiority of TSR over other missing data approaches used by practitioners. Furthermore, the method allows for outlier detection and correction.
  • Publication
    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects
    (Wiley, 2012-07) Camacho Páez, José; Ferrer Riquelme, Alberto José; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; Ministerio de Ciencia e Innovación; Universitat de Girona; Ministerio de Economía y Competitividad; Universitat Politècnica de València
    [EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models. Extensions of the cross-validation idea have been proposed to select the number of components in principal components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithm
  • Publication
    Missing Data Imputation Toolbox for MATLAB
    (Elsevier, 2016-03-15) Folch Fortuny, Abel; Arteaga Moreno, Francisco Javier; Ferrer Riquelme, Alberto José; Dpto. de Estadística e Investigación Operativa Aplicadas y Calidad; Escuela Técnica Superior de Ingeniería Industrial; Grupo de Ingeniería Estadística Multivariante GIEM; Ministerio de Economía y Competitividad; Ministerio de Ciencia e Innovación
    [EN] Here we introduce a graphical user-friendly interface to deal with missing values called Missing Data Imputation (MDI) Toolbox. This MATLAB toolbox allows imputing missing values, following missing completely at random patterns, exploiting the relationships among variables. In this way, principal component anal- ysis (PCA) models are fitted iteratively to impute the missing data until convergence. Different methods, using PCA internally, are included in the toolbox: trimmed scores regression (TSR), known data regres- sion (KDR), KDR with principal component regression (KDR-PCR), KDR with partial least squares regression (KDR-PLS), projection to the model plane (PMP), iterative algorithm (IA), modified nonlinear iterative partial least squares regression algorithm (NIPALS) and data augmentation (DA). MDI Toolbox presents a general procedure to impute missing data, thus can be used to infer PCA models with missing data, to estimate the covariance structure of incomplete data matrices, or to impute the missing values as a preprocessing step of other methodologies.