This study addresses the challenge of imputing missing values in incomplete process data collected from high-cost data acquisition environments. Such missingness arises due to insufficient completeness, accuracy, and consistency, which significantly affect the quality of critical-to-quality (CTQ) attributes in manufacturing processes. We systematically evaluate three state-of-the-art imputation methods—Multiple Imputation by Chained Equations (MICE), the machine learning-based missForest algorithm, and a deep learning- based one-dimensional convolutional neural network (1D-CNN)—using real-world industrial data. Our analysis aims to identify the most effective imputation technique for handling complex and noisy process datasets typical in manufacturing settings. The results highlight the strengths and limitations of each method, providing practical guidance for selecting appropriate imputation approaches to improve the reliability of quality prediction and decision-making in industrial applications.
This thesis studies two imputation methods, the MCMC method and the EM algorithm, that take care of the problem. The performance of the two methods for the linear (or quadratic) discriminant analysis are evaluated under various types of incomplete observations. Based on simulated experiments, the effect of the imputation using the EM algorithm and the MCMC method are evaluated and compared in terms of the probability of misclassification and the RMSE. This is done for the various cases of incomplete observations. The cases are differentiated by missing rates, sample sizes, and distances between two classification groups. The studies show that the probability of misclassification and the RMSE of the EM algorithm method is lower than the MCMC method. Therefore the imputation using the EM algorithm is more efficient than the MCMC method. And the probability of misclassification of the method that all vectors of observations with missing values are omitted from analysis is lower than the EM algorithm and the MCMC method when the samples size is small and the rate of missing values is extremely big.