https://doi.org/10.31449/inf.v47i3.4156 Informatica 47 (2023) 373–382 373 
Comparative Study of Missing Value Imputation Techniques on E-
Commerce Product Ratings 
Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta 
Department of Computer Engineering, J.C. Bose University of Science and Technology, YMCA, Faridabad, India 
E-mail: dimplechehal@gmail.com, parulgupta_gem@yahoo.com, gulatipayal@yahoo.co.in, 
tanishagupta067@gmail.com 
Keywords: imputation techniques, missing value, recommender system, sparsity, user ratings 
Received: May 5, 2022 
 
Missing data is typical as it adds ambiguity to data interpretation, and missing values in a dataset 
represent loss of vital information. It is one of the most common data quality concerns, and missing 
values are typically expressed as NANs, blanks, or other placeholders. Missing values create 
imbalanced observations, biased estimates and sometimes lead to misleading results. As a result, to 
deliver an efficient and valid analysis, there arises a need to take the solutions into account 
appropriately. By filling in the missing values, a complete dataset can be created and the challenge of 
dealing with complex patterns of missingness can be avoided. In the present study, eight different 
imputation methods: SimpleImputer, KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, 
Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) have 
been compared. The comparison has been performed on Amazon cell phone dataset based on three 
parameters: R- Squared Error (R
2
), Mean Squared Error (MSE), and Mean Absolute Error (MAE). 
Based on the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared 
error (R
2
). In terms of Mean Squared Error (MSE) and Mean Absolute Error (MAE), the Hot Deck 
imputation approach fared best, whereas MissForest performed worst for Mean Absolute Error (MAE). 
The Hot Deck imputation approach seems to be of interest and should be investigated further in 
practice. 
Povzetek: Primerjava tehnik imputiranja manjkajoče vrednosti pri ocenah izdelkov e-trgovine 
 
1 Introduction 
Missing data occurs frequently in research such as 
Clinical Trials, Climatology and Medicine as it adds a 
layer of ambiguity during data interpretation [9], [19], 
[1], [5]. Nowadays, most databases present a problem of 
incomplete data. Missing values in a dataset mean loss of 
important information. These are values that are not 
present in the data set and are written as NAN’s, blanks, 
or any other placeholders. Missing value creates 
imbalanced observations, biased estimates and in some 
cases can direct to misleading results. There can be 
multiple reasons for the missing value in a dataset such 
as failure to capture data, incorrect measurements or 
defective equipment, data corruption, sample 
mishandling, a low signal-to-noise ratio, measurement 
inaccuracy, non-response, or a deleted anomalous result 
[15], [10]. Building a machine learning algorithm with a 
dataset containing missing values can have a major 
impact on machine learning models as well as on the 
outcomes. Missing values can be of both continuous and 
categorical types. To get more precise results, multiple 
techniques can be used to fill out missing values.  
Many approaches for dealing with missing data have 
been presented in recent years, and they can be 
categorized as deletion and imputation. There are three  
 
 
common deletion approaches list wise deletion, pair-wise 
deletion, and feature deletion. The common approach in 
list wise or case elimination is to omit the cases with 
missing values and evaluate the remaining data. Pair-
wise deletion, on the other hand, removes data only when 
the specific data points required to test a hypothesis are 
missing. The existing values are employed in statistical 
testing if there is missing data elsewhere in the data set. 
A pair-wise deletion maintains more information than a 
list wise deletion since it uses all information observed 
[11]. 
Imputation on the other hand is the process of 
identifying missing values and interchanging them with a 
substitute value is known as missing value imputation 
[13], [6]. The method of missing value imputation is 
depicted in Figure 1. The experiment begins with the 
selection of a dataset, which is then characterized as 
incomplete or complete based on the quantity of missing 
data in the dataset. When a dataset is classified as 
incomplete, it is split into two parts: complete data and 
missing data. Imputation methods employ the entire 
dataset to impute missing values in the dataset. After 
that, a complete dataset with no missing values is 
created. The performance of the imputation methods is 
computed when the whole dataset and experimental 
dataset are compared using performance measures.  
374 Informatica 47 (2023) 373–382 D. Chehal et al. 
 
Figure 1: Missing value imputation process. 
 
Single imputation and multiple imputations are two 
subgroups of the numerous imputation techniques. In 
single imputation, only one value is present for each 
missing cell and the value thus generated is used as the 
original value, although no imputation method can 
provide the exact value [18], [25]. The workflow for 
single imputation is depicted in Figure 2. First, the type 
of missing data is determined, and then single imputation 
is chosen from the two alternatives of single and multiple 
imputations, which is further separated into explicit and 
implicit modeling. The assumptions are explicit because 
the predicted distribution in explicit modeling is based on 
a formal statistical model, like multivariate normal. This 
process employs the mean imputation and regression 
imputation techniques. Hot Deck imputation, 
substitution, and cold deck imputation are all part of this 
procedure. 
 
 
Figure 2: Work flow for single imputation. 
 
In multiple imputations of a missing cell, multiple 
values are generated to impute the cell. Many complete 
data sets with various imputed values generate after 
which each data set is analyzed independently and the 
results computed. In contrast to single imputations, 
multiple imputations account for statistical uncertainty in 
the imputations [21], [7]. The workflow for multiple 
imputations is depicted in Figure 3. First, the type of 
missing data is determined, and then multiple 
imputations are chosen from the two alternatives of 
single and multiple imputations. Several imputations 
generates multiple values from separate imputed sets, 
which are then analyzed after calculating a single value 
for each missing value, and a single value is chosen from 
all the values to impute a missing value in the incomplete 
dataset. As a result, there are three separate phases to the 
multiple imputation technique: 
a. M handles missing data, resulting in M 
complete data sets. 
b. After that, the M full data sets are analyzed. 
c. For the final imputation result, the outcomes of 
all M imputed data sets are pooled. 
 
 
Figure 3: Work flow for multiple imputation. 
 
Existing imputation techniques have been compared 
using R Squared (R
2
), Mean Absolute Error (MAE), and 
Mean Squared Error (MSE) Metrics. 
There are three main types of missing values: 
a. Missing completely at random (MCAR) 
b. Missing at random (MAR) 
c. Not missing at random (NMAR) 
The relationship between missingness and the values 
of the variables in the dataset is stated by the missing 
data mechanism. A dataset Y is stated to be a 
combination of a variable that is observed and a variable 
that is missing (Y obs and Y mis, respectively). The first 
type is known as missing completely at random 
(MCAR), in which the value itself or any known value is 
not a determinant of the missing values. Thus, Y obs and 
Y mis have no effect on the likelihood of a missing value 
[11], [8], [14]. The second type is Missing at random 
(MAR) is the polar opposite of MCAR, in which missing 
Comparative Study of Missing Value Imputation Techniques on…                                          Informatica 47 (2023) 373–382 375 
values are dependent on known values or on the value 
itself. Thus, the probability of a missing value is 
independent of Y mis or Y obs. MAR and MCAR can be 
ignored because it is impossible to adjust for the 
missingness. The final form Not missing at random 
(NMAR), where the probability of a missing occurrence 
varies [14]. 
This study is divided into seven sections. A brief 
past related work is provided in Section 2. Different 
missing value patterns are explained in section 3. In 
section 4, a description of the dataset as well as data 
analysis is given. The paper's results, as well as the 
evaluation criteria used, are explained in section 5, and 
the study's conclusion is shown in section 6. 
 
2 Related work 
There are multiple techniques to impute missing value’s 
the first and the oldest one is SimpleImputer in which 
mean of a single column is computed to fill the missing 
value or cell with the mean computed of the rest of the 
cells of that column. SimpleImputer leads to poor 
imputation because it ignores correlation between 
different features [14].  
Whenever the variables have a non-linear 
connection, linear regression-based imputation may 
underperform. The conditional model for imputation is 
Classification and Regression Trees (CART) [3]. 
Random forest extensions also have yielded encouraging 
results [22]. The decision tree-based imputation 
techniques are non-parametric algorithms that do not 
forecast the distribution of the data. 
K-Nearest Neighbors (K-NN) based imputation is 
one of the most often used non-parametric techniques. 
This technique replaces the observed values in dimension 
d for each missing element with the mean of the K-
nearest neighbors' d
th
 dimension [24]. Sequential K-NN 
is a K-NN extension that begins by imputing missing 
values from observations with the fewest missing 
dimensions and then moves on to the next unknown 
entries while reusing the previously imputed values [12]. 
Iterative K-NN uses an iterative procedure to re-estimate 
the estimates and select the closest neighbors based on 
the previous iteration's estimations. 
Single imputation approaches produce a single set of 
finished data that may be utilized for statistical analysis. 
Whereas, multiple imputations, impute numerous times 
(each set may be different), then conduct statistical tests 
on all sets and combine the results. This strategy can 
capture the variability in missing data and, as a result, 
produce potentially more accurate estimates for the wider 
statistical problem. Multiple imputation approaches, on 
the other hand, are slower and necessitate pooling of 
results, which may not be appropriate for some 
applications. 
The process for generating several estimates of 
missing data varies within the multiple imputation 
frameworks. A common multiple imputation method, 
multivariate imputation by chained equations (MICE), 
generates estimates using predictive mean matching, 
Bayesian linear regression, logistic regression, and other 
techniques [4]. Missing data imputation is still a hot topic 
in research because of its importance. Despite the fact 
that there are several approaches, many of them have 
serious shortcomings and their own pros. 
In the event of missing values, information 
management is critical. Planning, organizing, structuring, 
processing, regulating, assessing, and reporting 
information operations are all part of the information 
management cycle. The major goal of information 
management is to produce and manage data in order to 
gain better insights; hence in missing value imputation, 
missing data is discovered using various strategies both 
of single imputation and multiple imputations in order to 
gain a better understanding of datasets and compute 
important and numerically significant conclusions. When 
managed information is fed into any algorithm, the 
algorithm's performance improves, ultimately assisting in 
the resolution of recent technological issues. 
3 Missing data patterns and 
imputation approaches 
Missing data patterns explain which values in the dataset 
are missing and which values should be observed. 
Univariate, monotone, and non- monotone missing data 
patterns are the three types of missing data patterns. 
a. Univariate: When only one variable has missing 
data, the data is classified as univariate missing 
data pattern. To be classified in Univariate, the 
missing values should be in one column [17]. 
b. Monotone: When data is ordered and the 
pattern is frequently connected with longitudinal 
studies where participants drop and never 
return, it is called Monotone data. This is easier 
to detect because they are more visible and 
distinguishable [2]. 
c. Non- Monotone: Data is non-monotone when 
missing values in one variable or column have 
no effect on the values of other columns or the 
missing values of other columns [20]. 
Missing value imputation is the most important part 
of data analysis since it ensures that the dataset is 
complete and the results are computed correctly. There 
are mainly two types of imputation techniques single 
imputation and multiple imputations. In this experiment, 
techniques like SimpleImputer, KNN Imputation (KNN), 
Hot Deck, Linear Regression, MissForest, Random 
Forest Regression, DataWig, and Multivariate Imputation 
by Chained Equation (MICE) will be compared and 
evaluated. Advantages and disadvantages of these 
techniques have been shown in Table 1. 
a. Imputation using SimpleImputer: 
SimpleImputer is a scikit-learn class that aids 
with missing data imputation in datasets used 
for predictive modeling [16], [23]. It substitutes 
a placeholder for the NaN values. 
SimpleImputer employs a variety of strategies to 
impute values, one of which is the use of 
mean/median to replace missing values. In this 
technique, the mean or median of the non-
376 Informatica 47 (2023) 373–382 D. Chehal et al. 
missing values is computed, and the missing 
values in the column are imputed using the 
computed mean or median value. This technique 
is best applied to numerical values rather than 
categorical ones. Mean imputation is quick and 
simple to implement, it preserves the mean of 
the observed data. This implies if data is 
Missing completely at random (MCAR), the 
estimate of mean remains unbiased. However, 
mean imputation is less accurate than other 
impute techniques.  
b. Imputation using KNNImputer: KNNImputer 
is a scikit-learn python machine learning library 
that aids in nearest neighbor imputation [16]. In 
KNN imputation, the distance between data 
points is measured and the number of 
contributing neighbors is chosen for each 
prediction. The number of nearest neighbors 
used to predict a missing value is usually 
controlled by the value of K, which has a direct 
impact on the KNN algorithm's performance. A 
high K value reduces the impact of random error 
on variance, but it also increases the risk of 
missing important small-scale patterns. When 
selecting an appropriate value of K, it is critical 
to strike a balance between over fitting and 
under fitting.  
c. Hot Deck imputation: In a sample set with 
similar values on all other variables, Hot Deck 
imputation selects one value at random from 
each individual set of values. This means that all 
records in the dataset with similar values in 
other variables are searched, and any one record 
is selected and utilized to impute the missing 
values [17].The benefit is that no outliers are 
created in the dataset as a result of this method. 
d. Imputation using Linear Regression: 
Regression is a two-step procedure in which a 
regression model is first constructed utilizing all 
of the available and complete data points. The 
created model is then used to impute missing 
data. In linear regression a regression equation 
is formed in which the best predictors are 
classed as independent variables, whereas 
variables with missing data are labeled as 
dependent variables. The missing values are 
predicted using a regression equation using 
independent and dependent variables. Values for 
the missing variable are inserted in an iterative 
procedure, and then all cases are utilized to 
forecast the dependent variable. These steps are 
repeated until the projected values are almost 
identical from one step to the next, at which 
point they converge. 
e. Imputation using MissForest: MissForest is a 
machine learning data imputation method that is 
based on the random forest algorithm [22]. 
Firstly the missing data are imputed using 
median/mode imputation. Then the non-missing 
values are marked as training rows and missing 
values are marked as predicted, the training 
rows are fed into a random forest model used to 
predict the missing values. The training rows are 
then fed into a random forest model that predicts 
missing values. The projected values are then 
imputed to replace the existing values, resulting 
in a dataset that is full and free of missing 
values. To enhance imputation in each iteration, 
the entire procedure is done numerous times. 
MissForest is capable of handling numerical, 
categorical, and mixed data types. MissForest is 
created with the missingpy library. 
f. Imputation using Random Forest Regression: 
The Random Forest is a Meta estimator 
technique that employs averaging to increase 
predicted accuracy and control over-fitting by 
fitting several classification decision trees on 
various sub-samples of the dataset. Random 
forest regression is a supervised learning 
approach for regression that uses the ensemble 
learning method. The ensemble learning method 
combines predictions from several machine 
learning algorithms to get a more accurate 
forecast than a single model. For regression 
problems, the mean or average forecast of the 
individual trees is computed known as 
aggregation. Instead of depending on individual 
decision trees, the main idea is to aggregate 
numerous decision trees to determine the final 
outcome. As a fundamental learning model, 
Random Forest uses several decision trees. Row 
and feature sampling are done at random from 
the dataset, resulting in sample datasets for each 
model this process is known as bootstrap. 
g. Imputation using Deep Learning 
(DataWig):DataWig is a machine learning 
package that employs Deep Neural Networks to 
impute missing values in a dataset [2]. DataWig 
combines deep learning feature extraction with 
automatic hyper parameter tuning. This 
approach applies to both categorical and non-
numerical data. DataWig first determines the 
type of each column. The column is then 
translated to a numerical representation. 
DataWig can be used to train on both the CPU 
and the GPU. DataWig typically works on a 
single column at a time, with the target column 
holding information about the imputing column 
supplied ahead of time.  
h. Imputation using Multivariate Imputation by 
Chained Equation (MICE): In multiple 
imputations, many imputations are created for 
each missing value. It means filling the missing 
values multiple times and creating multiple 
complete datasets. One well-known algorithm 
for multiple imputations is Multiple Imputation 
by Chained Equation (MICE). MICE works 
under the assumption that missing data is 
Missing at random (MAR) or Missing 
completely at random (MCAR). Implementing 
MICE when data is not MAR could result in 
biased estimates. MICE is very flexible 
Comparative Study of Missing Value Imputation Techniques on…                                          Informatica 47 (2023) 373–382 377 
technique and can handle multiple variables and 
complexities of varying types at a time. It 
employs a divide-and-conquer strategy to 
impute missing values in dataset variables, 
focusing on one variable at a time. Once the 
emphasis is placed on that variable, it uses all of 
the other variables in the data set to forecast 
missingness in that variable. A regression 
model, the form of which is dictated by the 
nature of the focal variable, is used to make the 
prediction.  
 
 
Table 1: Advantages and disadvantages of imputation techniques 
S. 
No 
Method Advantages Disadvantages 
1. SimpleImputer 1. It's a simple and quick procedure. 
2. It's suitable for small numerical 
datasets. 
1. Correlation between features is not 
taken into account. 
2. Not extremely precise. 
2. KNNImputer 1. Better than SimpleImputer in terms of 
accuracy  
 
1. KNN operates by memorizing the 
entire training dataset 
2. Sensitive to outliers 
3. Hot Deck 
imputation 
1. Because of residuals, the imputed 
data will have the same distribution 
shape as the actual data. 
2. It's good for categorical data. 
1. It's not good for small sample sizes. 
4. Linear Regression 1. For numeric data, this strategy is 
more effective.  
1. If the prediction power is poor, this 
approach will perform poorly. 
5. MissForest 1. The looping over missing data point’s 
process is repeated numerous times, 
with each iteration improving on 
improved data. 
2. It can be used with both numerical 
and category data. 
3. There is no need for preprocessing. 
1. Time consuming because the number 
of iterations is dependent on the size 
of the dataset. 
2.  Expensive to operate MissForest 
6. Random Forest 
Regression 
1. Outlier resistant. 
2. Does a good job with non-linear data. 
3. Less chance of over fitting. 
4. Performs well on a huge dataset. 
1. Slow and steady training. 
2. Linear approaches with a lot of 
sparse features aren't recommended. 
7. Deep Learning 
(DataWig) 
1. It works with categorical data. 
2. Supports both CPUs and GPUs  
1. Slow when dealing with large 
datasets 
2. Imputation of a single column. 
8. Multivariate 
Imputation by 
Chained Equation 
(MICE): 
1. Unbiased estimates, which are more 
reliable than ad hoc responses to 
missing data 
1. MICE works under the assumption 
that missing data is Missing at 
random (MAR) or Missing 
completely at random (MCAR) 
4 Experiments on rating predictions 
This section details the dataset used and its 
corresponding analysis. 
4.1 Dataset description 
In this study, the publicly accessible dataset from 
Amazon of cell phone and accessories has been used. In 
the 5-core dataset, all users and items have at least five 
reviews. It consists of 1048570 rows and 12 columns. 
The 12 columns are overall (rating of product), Verified 
(for verified product by Amazon), ReviewTime (time of 
review submission), ReviewerID (ReviewerID of  
 
 
 
each reviewer), Asin (product ID), Style (sparse value 
pertaining to product's color), ReviewerName (name of 
the reviewer), ReviewText (review text), Summary 
(review summary), UnixReviewTime (review time 
(UNIX time)), Vote (total number of votes earned by a 
product), Image (product image link). 
The primary columns to pay attention are verified, 
vote and rating. Then the dataset is preprocessed to 
ensure that every product has a vote value because the 
data is massive and sparse in the vote column. The 
dataset was reduced to 90714 rows and 12 columns after 
preprocessing. 
 
378 Informatica 47 (2023) 373–382 D. Chehal et al. 
4.2 Data analysis 
The principle of analysis is depicted in Figure 4. Initially, 
there were no missing values in the dataset. As a result, 
missing values of about 4\% were created in the original 
dataset (Amazon 5-core) based on the MCAR model in 
the overall column, and imputation was performed using 
several strategies. These missing values were simulated 
and imputed using the eight techniques and three 
evaluation criteria (R-squared error, MAE, and MSE). R-
squared, a statistical measure represents the degree of 
goodness of fit of a regression model. The best r-square 
value is 1. 
 
 
Figure 4: Principal of analysis 
 
 
5 Results and discussion 
Missing values were imputed using eight distinct 
imputation approaches. With the use of the vote and 
verified columns, all of the strategies effectively imputed 
the missing values that were present in the Overall 
column. The three-assessment metrics were used to 
measure the performance of the techniques R
2
, MSE and 
MAE, Table 2 compares all the eight approaches based 
on these assessment metrics. 
a. R-squared: The closer the r-squared value is to 
1, the better the model fits. When the fitted models are 
worse than the average fitted model, the R-Squared value 
can be negative. The R-squared is determined by 
dividing the sum of squares of residuals from the 
regression model (SS RES) by the total sum of squares of 
errors from the average model (SS TOT), then subtracting 
1. The R-squared is mathematically defined by the 
equation 1: 
 
R
2 
= 1- 
𝑆𝑆
𝑅 𝐸 𝑆 𝑆𝑆
𝑇𝑂 𝑇    = 1-  
∑ ( 𝑦 𝑗 − 𝑦 ̂
𝑗 )
2
𝑖 ∑ ( 𝑦 𝑗 − 𝑦 ̅
𝑗 )
2
𝑖 (1) 
 
Results for the R-squared (R
2
) metrics: R
2
 usually has a 
range of 0 to 1. Figure 5 shows graph for R
2
. All eight 
approaches yielded a value ranging from -0.5 to 1 for R
2
. 
R
2
 values that are negative indicate that the fitted models 
are worse than the average fitted model. KNN with value 
0.9742 is the approach that produced the best R
2
 value. 
When computing the missing value in KNN, the K is set 
to 4, implying that the value for a missing point is 
computed using four nearest neighbors. DataWig, on the 
other side with an R
2
 of -0.5311, had the poorest 
performance. SimpleImputer, Hot Deck, MICE, and 
Random Forest Regression all received positive results, 
with values of 0.9744, 1.0, 0.9929, 0.97443, and 0.9745, 
respectively. Linear Regression and MissForest, on the 
other hand, calculated negative R
2
 values of -0.4356 and 
-0.0259, respectively. 
 
 
 
Figure 5: Graphical representation of comparison of imputation techniques with respect to R-squared error. 
 
0,9744 0,9742 0,9929
-0,4356
-0,0259
0,9744
-0,5311
0,9745
-0,8
-0,6
-0,4
-0,2
0
0,2
0,4
0,6
0,8
1
1,2
R squared error value
Imputation Techniques
Performance of imputation techniques with respect to R-squared error
R2
Comparative Study of Missing Value Imputation Techniques on…                                          Informatica 47 (2023) 373–382 379 
 
 
b. Mean Squared Error: The Mean Squared Error 
(MSE) is one of the most basic and often used loss 
functions. To calculate the MSE, take the difference 
between model's predictions and the ground truth, 
square it, and average it over the whole dataset. The 
value of MSE can never be negative because errors 
are always squared. The amount of samples tested is 
denoted by N. The advantage with MSE is that it is 
useful for ensuring that our trained model does not 
contain any outlier predictions with significant 
mistakes, as the squaring element of the function 
gives these errors more weight. The MSE is 
mathematically defined by the equation 2: 
 
MSE = 
1
𝑁 ∑ ( 𝑦 𝑗 𝑁 𝑗 = 1
− 𝑦 ̂
𝑗 )
2
  (2) 
 
 
Result for Mean squared Error (MSE) metrics: The 
Mean Squared Error ranges from 0 to infinity. Figure 6 
shows graph for MSE. The value point for MissForest is 
out of the range when compared to the other points; 
hence it isn't depicted in this graph. The MSE regression 
is the most widely used regression for loss functions. 
Because the real and predicted values are so near, the 
lower the MSE value, the higher the predicted values 
accuracy. MissForest (1207.2801) is the strategy that 
produced the highest MSE while Hot Deck (0.0145) 
produced the lowest value. MSEs are smaller than 1 for 
SimpleImputer (0.0514), KNN (0.0529), Random Forest 
regression (0.0515), and MICE (0.0513) and Linear 
Regression (1.2888) and DataWig (1.3746) have MSEs 
more than 1. 
 
 
 
Figure 6: Graphical representation of comparison of imputation techniques with respect to MSE. 
 
 
c. Mean Absolute Error: The difference between 
the model's predictions and the ground truth is used 
while computing the Mean Absolute Error (MAE) 
and the absolute value is applied to the difference and 
averaged throughout the entire dataset. The MAE 
advantage compensates for the MSE disadvantage 
directly. Because the absolute value is considered, all 
errors will be weighted on the same linear scale. As a 
result, unlike the MSE, the loss function will not 
place an excessive emphasis on outliers and will 
provide a general and consistent evaluation of how 
well our model is performing. The MAE is 
mathematically defined by the equation 3: 
 
MAE = 
1
𝑁 ∑ | 𝑦 𝑗 𝑁 𝑗 = 1
− 𝑦 ̂
𝑗 |        (3) 
 
 
 
 
 
Results for Mean Absolute Error (MAE) metrics: Mean 
absolute error ranges from 0 to infinity. Figure 7 shows 
graph for MAE. Initially, the MAE error is calculated in 
phases. By subtracting the predicting value from the 
actual value, the prediction error is calculated. Then, for 
each imputation, the prediction error is calculated and 
transformed to positive values. It is determined what the 
mean of all absolute errors is. The best MAE results were 
achieved by Hot Deck (0.0052), while the poorest MAE 
results was achieved by MissForest (7.6032). Other 
techniques produced results ranging from 0 to 1 such as 
MICE (0.0410), SimpleImputer (0.0411), KNN (0.0245), 
Linear Regression (1.0319), Random Forest Regression 
(0.0410) and DataWig (1.0768). The result of measuring 
the difference between any two continuous variables is 
generally referred to as Mean Absolute Error. 
 
 
0,0514 0,0529
0,0145
1,2888
0,0515
1,3746
0,0513
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
SimpleImputer KNN Hot Deck Linear
regression
Random Forest
Regression
DataWig MICE
Mean squared error value
Imputation Techniques
Performance of imputation techniques with respect to Mean Squared Error
MSE
380 Informatica 47 (2023) 373–382 D. Chehal et al. 
 
Figure 7: Graphical representation of comparison of imputation techniques with respect to MAE. 
 
As shown in Table 2 Hot Deck imputation technique 
is the best technique that provides the most promising 
outcomes and should be considered further, while 
MissForest produced the worst results. All of the other 
strategies produced outcomes that might be improved 
over time by making simple adjustments. 
 
Table 2: Performance comparison of imputation 
techniques 
Techniques R
2
 MSE MAE 
SimpleImputer 0.9744 
 
0.0514 
 
0.0411 
 
KNN 0.9742 
 
0.0529 0.0245 
 
Hot Deck 0.9929 
 
0.0145 0.0052 
 
Linear 
regression 
-0.4356 
 
1.2888 
 
1.0319 
 
MissForest -0.0259 
 
1207.2801 
 
7.6032 
 
Random Forest 
Regression 
0.9744 
 
0.0515 
 
0.0410 
 
DataWig -0.5311 
 
1.3746 
 
1.0768 
MICE 0.9745 
 
0.0513 0.0410 
 
 
6 Conclusion 
When a value in a dataset goes missing, important 
information is lost. To avoid this, missing values are 
imputed. The term "imputing values" refers to the 
statistical computation of a value for a missing value 
based on surrounding values or values from the same 
column. In data analysis, post imputation is significant 
because it ensures that the dataset is complete and that 
the findings are computed and arranged accurately. Eight 
techniques have been explored in this experiment to 
compute missing values for the Amazon dataset. Only 
the three columns (Overall, Verified, and Vote) have 
been utilized to conduct the experiment. Overall column 
contains missing values, and hence is the most essential 
column. After imputing the missing values accurately, 
the outcomes have been evaluated using three evaluation 
parameters-R
2
, MAE and MSE. Hot Deck Imputation 
technique has surpassed all other techniques in terms of 
imputation results. The performance metrics for Hot 
Deck are within the range; however, MissForest’s values 
are outside the range, making it the lowest performing 
technique. 
References 
[1] Afrifa-Yamoah, E. et al. 2020. Missing data 
imputation of high-resolution temporal climate time 
series data. Meteorological Applications. 27, 1 
(2020), 1–18. 
DOI:https://doi.org/10.1002/met.1873. 
[2] Bießmann, F. et al. 2019. DataWig: Missing value 
imputation for tables. Journal of Machine Learning 
Research. 20, (2019), 1–6. 
[3] Burgette, L.F. and Reiter, J.P. 2010. Multiple 
Imputation for Missing Data via Sequential 
Regression Trees. American Journal of 
Epidemiology. 172, 9 (Nov. 2010), 1070–1076. 
DOI:https://doi.org/10.1093/AJE/KWQ260. 
[4] Chhabra, G. et al. 2019. A review on missing data 
value estimation using imputation algorithm. 
Journal of Advanced Research in Dynamical and 
Control Systems. 11, 7 Special Issue (2019), 312–
318. 
[5] Cismondi, F. et al. 2013. Missing data in medical 
databases: Impute, delete or classify? Artificial 
Intelligence in Medicine. 58, 1 (2013), 63–72. 
DOI:https://doi.org/10.1016/j.artmed.2013.01.003. 
[6] Ghazanfar, M.A. and Prugel-Bennett, A. 2013. The 
advantage of careful imputation sources in sparse 
data-environment of recommender systems: 
Generating improved SVD-based 
recommendations. Informatica (Slovenia). 37, 1 
(2013), 61–92. 
[7] Graham, J.W. et al. 2003. Methods for Handling 
Missing Data. Handbook of Psychology. (2003). 
0,0411 0,0245 0,0052
1,0319
7,6032
0,041
1,0768
0,041
0
1
2
3
4
5
6
7
8
SimpleImputer KNN Hot Deck Linear
regression
MissForest Random Forest
Regression
DataWig MICE
Mean absolute error value
Imputation techniques
Performance of imputation techniques with respect to Mean Absolute Error
MAE
Comparative Study of Missing Value Imputation Techniques on…                                          Informatica 47 (2023) 373–382 381 
DOI:https://doi.org/10.1002/0471264385.wei0204. 
[8] Heitjan, D.F. and Basu, S. 1996. Distinguishing 
“missing at random” and “missing completely at 
random.” American Statistician. 50, 3 (1996), 207–
213. 
DOI:https://doi.org/10.1080/00031305.1996.10474
381. 
[9] Jakobsen, J.C. et al. 2017. When and how should 
multiple imputation be used for handling missing 
data in randomised clinical trials - A practical guide 
with flowcharts. BMC Medical Research 
Methodology. 17, 1 (2017), 1–10. 
DOI:https://doi.org/10.1186/s12874-017-0442-1. 
[10] Kaiser, J. 2014. Dealing with Missing Values in 
Data. Journal of Systems Integration. (2014), 42–
51. DOI:https://doi.org/10.20470/jsi.v5i1.178. 
[11] Kang, H. 2013. The prevention and handling of the 
missing data. Korean Journal of Anesthesiology. 
64, 5 (2013), 402. 
DOI:https://doi.org/10.4097/kjae.2013.64.5.402. 
[12] Kim, K.Y. et al. 2004. Reuse of imputed data in 
microarray analysis increases imputation efficiency. 
BMC Bioinformatics. 5, 1 (Oct. 2004), 1–9. 
DOI:https://doi.org/10.1186/1471-2105-5-
160/FIGURES/3. 
[13] Lin, W.C. and Tsai, C.F. 2020. Missing value 
imputation: a review and analysis of the literature 
(2006–2017). Artificial Intelligence Review. 53, 2 
(2020), 1487–1509. 
DOI:https://doi.org/10.1007/s10462-019-09709-4. 
[14] Little, R.J.A. and Rubin, D.B. 2014. Statistical 
analysis with missing data. Statistical Analysis with 
Missing Data. (Jan. 2014), 1–381. 
DOI:https://doi.org/10.1002/9781119013563. 
[15] Mandel J, S.P. 2015. A Comparison of Six Methods 
for Missing Data Imputation. Journal of Biometrics 
& Biostatistics. 06, 01 (2015), 1–6. 
DOI:https://doi.org/10.4172/2155-6180.1000224. 
[16] McAuley, J. et al. 2015. Image-based 
recommendations on styles and substitutes. SIGIR 
2015 - Proceedings of the 38th International ACM 
SIGIR Conference on Research and Development in 
Information Retrieval. (2015), 43–52. 
DOI:https://doi.org/10.1145/2766462.2767755. 
[17] Myers, T.A. 2011. Goodbye, Listwise Deletion: 
Presenting Hot Deck Imputation as an Easy and 
Effective Tool for Handling Missing Data. 
Communication Methods and Measures. 5, 4 
(2011), 297–310. 
DOI:https://doi.org/10.1080/19312458.2011.62449
0. 
[18] Plaia, A. and Bondì, A.L. 2006. Single imputation 
method of missing values in environmental 
pollution data sets. Atmospheric Environment. 40, 
38 (2006), 7316–7330. 
DOI:https://doi.org/10.1016/j.atmosenv.2006.06.04
0. 
[19] Ropper, A.H. et al. 2012. Hyperosmolar Therapy 
for Raised Intracranial Pressure. New England 
Journal of Medicine. 367, 26 (2012), 2554–2557. 
DOI:https://doi.org/10.1056/nejmc1212351. 
[20] Schuetz, C.G. 2008. Using neuroimaging to predict 
relapse to smoking: role of possible moderators and 
mediators. International journal of methods in 
psychiatric research. 17 Suppl 1, 1 (2008), S78–
S82. DOI:https://doi.org/10.1002/mpr. 
[21] Sinharay, S. et al. 2001. The use of multiple 
imputation for the analysis of missing data. 
Psychological Methods. 6, 3 (2001), 317–329. 
DOI:https://doi.org/10.1037/1082-989x.6.4.317. 
[22] Stekhoven, D.J. and Bühlmann, P. 2012. 
MissForest—non-parametric missing value 
imputation for mixed-type data. Bioinformatics. 28, 
1 (Jan. 2012), 112–118. 
DOI:https://doi.org/10.1093/BIOINFORMATICS/B
TR597. 
[23] Tan, Y. et al. 2018. Probability matrix 
decomposition based collaborative filtering 
recommendation algorithm. Informatica (Slovenia). 
42, 2 (2018), 265–271. 
[24] Troyanskaya, O. et al. 2001. Missing value 
estimation methods for DNA microarrays. 
Bioinformatics. 17, 6 (Jun. 2001), 520–525. 
DOI:https://doi.org/10.1093/BIOINFORMATICS/1
7.6.520. 
[25] Zhang, Z. 2016. Missing data imputation: Focusing 
on single imputation. Annals of Translational 
Medicine. 4, 1 (2016). 
DOI:https://doi.org/10.3978/j.issn.2305-
5839.2015.12.38. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382 Informatica 47 (2023) 373–382 D. Chehal et al.