https://doi.org/10.31449/inf.v47i2.4155 Informatica 47 (2023) 275–284 275 Predicting the Usefulness of E-Commerce Products’ Reviews Using Machine Learning Techniques Dimple Chehal, Parul Gupta, Payal Gulati Department of Computer Engineering, J.C. Bose University of Science and Technology, YMCA, Faridabad, India E-mail: dimplechehal@gmail.com, parulgupta_gem@yahoo.com, gulatipayal@yahoo.co.in Keywords: classification, e-commerce, machine learning, recommender system, usefulness, user reviews Received: May 5, 2022 User-generated reviews are an essential component of e-commerce platforms. The presence of a large number of these reviews creates an information overload problem, making it difficult for other users to establish their purchase decision. A review voting mechanism, in which users can vote for or against a review, addresses this issue (as helpful or not). The helpful votes on a review reflect its usefulness to other users. As voting on usefulness is optional, not all reviews receive this vote. Furthermore, reviews posted recently by users are not associated with any vote (s). The aim of this paper is to predict the usefulness of user reviews through machine learning techniques. Using the Amazon product review dataset of cell phones, classification models are built on eight features and compared on seven performance measures. As per results, all the classification models performed well, except Linear Discriminant Analysis. The classification performance of Logistic Regression, Decision Tree, Random Forest, AdaBoost, and Gradient Boost was unaffected by feature selection or outlier removal. The performance of Linear Discriminant Analysis improved after feature selection but decreased after outlier removal, whereas ET and KNN classifiers improved in both cases. Povzetek: Uporaba tehnik strojnega učenja za napovedovanje uporabnih ocen izdelkov e-trgovine. 1 Introduction Online consumer reviews have evolved for e-commerce users and its stakeholders as an electronic word of mouth (eWoM) [32],[30]. Product reviews comprise of detailed experience of the customer(s) with the product(s). They help the consumers in their purchase decision, indicate any improvement required in the products’ quality, thereby helping the business organizations in improving the products’ sales. Mining customer reviews through sentiment analysis or topic modeling techniques reveal the customer’s inclination towards a product. This helps in building the customer profile and understanding his/her preference for unseen products. Many platforms such as Amazon, Yelp, TripAdvisor, IMDB and Netflix are hosting a large number of user reviews [35]. However, the ever-growing rise in the number of products, customers and product reviews on the e- commerce platform, has led to the information overload problem and has made it infeasible for the customers to browse all the product reviews. To overcome this problem, voting a review as helpful by other customers had been introduced. While the rating of a product depicts a user’s experience with a product, the votes gained by a review indicate its usefulness. The solution can be browsing user reviews according to their helpfulness or usefulness. But, due to factors such as humongous volume of electronic word of mouth, voluntary helpfulness voting mechanism, level of visibility and their recentness, all reviews do not receive this vote [5],[27]. Hence, the objective of this study is to categorize the product review according to its usefulness. This will not only help the customers to identify the products as useful or useless even if the review has not gained any votes but can also be fed as input to the recommender system for generating useful recommendations to the users. Also, through this study, the following questions have been answered- • Which is the most efficient machine learning algorithm for the forecasting the usefulness of a product review? • Which features should help in determining the usefulness of product review? The results to the above questions have been obtained with the help of cell phone and accessories dataset taken from Amazon [3]. Eight different machine learning models, namely, Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), AdaBoost (ADA), Gradient Boost (GB), Extra Trees (ET), K Nearest Neighbors (KNN) and Linear Discriminant Analysis (LDA), have been trained and tested on existing and derived features and have been evaluated on seven evaluation metrics such as Area under the Curve (AUC), Accuracy (ACC), F1-score (F1) , Precision (P), Recall (R), Mathew’s Correlation Coefficient (MCC) and Kappa score [19]. The best model has been fine tuned for prediction of usefulness of review. This study’s contributions are stated as follows: 1. Through this research, features such as overall rating, user review, review summary, review 276 Informatica 47 (2023) 275–284 D. Chehal et al. votes, word count of review, character count of review, review’s sentiment score, average word count of review have been used to predict a review’s usefulness. 2. Along with the already existing features in the Amazon dataset such as overall rating, user review, review summary and review votes, other features have been derived from user review and used in combination as input to the prediction model. 3. This study enables customers to identify useful reviews and e-commerce managers, merchants, retailers to improve the listing of product reviews based on the review usefulness. The rest of the study is structured as follows: Section 2 consists of the related work, Section 3 details the research methodology followed. While section 4 discusses the result of different experiments conducted on the dataset. Lastly, section 5 concludes by discussing the limitations and future work. 2 Related work Online user testimonials have gained much-needed prominence in the literature as they instill trust in other potential consumers in the online community [9], [17]. Product reviews can be viewed as a type of passive recommendation process or visibility of user sentiment for their past purchases [12]. Critical management choice for policy-makers is to manage user review to improve customer review efficacy. The academic evidence on review usefulness is largely driven and aided by review hosting platforms, which offer users’ opinions on reviews’ helpfulness explicitly. For instance on Amazon, customers not only access the rating and text content of each user review, but they also view the number of votes the review obtains from the fellow users and the number of helpful votes [35], [25]. Consumers benefit from informative reviews while making buying decisions. Some customers believe that favorable and unfavorable reviews are useful because such deeply divided records help to validate or invalidate purchase options. Others, on the contrary, find mixed reviews useful because they illustrate both the positives and negatives of the product under consideration. The perceived importance of a review to the end-user is also conveyed through the review's usefulness [28], [18]. This functionality, in particular, makes use of crowd-sourcing to assess the usefulness of reviews [6]. Every review includes the question, "Was this review helpful to you?" Customers who read the reviews may up vote or down vote the review [9], [12]. The research on review usefulness is roughly classified into two categories, prediction-based techniques to ascertain the review's usefulness and understanding of review usefulness. Machine learning classifiers, regression and deep learning approaches have been used to predict review helpfulness in the past [10], [8], [14], [16]. The review length, review timestamp, reviewer’s expertise, and manner of writing reviews all have been used previously to predict helpful reviews [5]. Early indicators used to identify review usefulness through review length and review star rating also [24]. Deviation from the mean review length of a product, review’s polarity and rating from the same user or on the same item to estimate review helpfulness helped in filtering high-quality ratings thereby improving the collaborative item recommendation process [27]. The moderate-length reviews outperform brief and lengthy ones as review length has inverted-U-shaped impact on usefulness [15]. Further, the more the review matches the language style of the target user, the more it is said to be readable. As a result, it is classified as a domain-specific indicator [22]. The semantic analysis of reviews comprises a wide range of methodologies that make use of structural characteristics like the count of product features cited in a review and its length [34]. The most useful reviews are said to be medium in length, have a lower score, and are negative or neutral in polarity [13]. Both critical evaluations containing data on service failures and favorable reviews highlighting essential product functionalities, technical elements, and aesthetics are seen as beneficial for usefulness prediction [1]. Besides the semantic aspect, neutral polarity reviews are regarded to be also useful [31]. The inclusion of adjectives, status and action verbs, as well as grammatical structure, are vital predictors of helpfulness, particularly when paired with factors such as review age, rating, readability, and subjectivity [21]. Highly readable reviews have proven to be the most beneficial. Based on previously performed emotion-based analysis, it has been concluded that male readers were more inclined to reviews with positive emotions, whereas female readers were more attracted to reviews with negative emotions. Previous findings also indicated that several features such as the review title’s polarity, the review’s sentiment and polarity, and the cosine similarity between the product review and the product title are contributing factors to determine the usefulness of user reviews [24]. As per the literature review, previous studies are deficient in terms of the combination of natural language processing tools and machine learning techniques for estimation of review usefulness. This study considering the above employs user voting as the target label to build the helpfulness or usefulness prediction system. This depiction of helpfulness votes differs across platforms. Some platforms show the most helpful votes for a review, whereas others represent the usefulness as the “X of Y” concept. However, in prior methods, a ratio of 0.6 was considered as a helpfulness threshold in the “X of Y” approach of the consumer voting mechanism. Review usefulness, in particular, is critical in product rankings and recommendations [12]. Prediction of review usefulness enables users to compose meaningful reviews that shall assist retailers in intelligent website management by guiding it s users in purchase decisions [24]. The incorporation of a usefulness estimation model can aid in increasing the effectiveness of a collaborative filtering-based recommender system through optimization of data selection for user ratings estimation. This is a great resource for identifying relevant user Predicting the Usefulness of E-Commerce Products’ Reviews… Informatica 47 (2023) 275–284 277 Table 1: Comparison of existing studies on identification of useful reviews S. N o Pap er Model Dataset Input features Performance Metric Key points Classification / Regression problem 1. [20] MLP, CNN with Trans E Amazon dataset: CDs, Electronics, Video Games, Books Product, Review, Reviewer features Accuracy, F1- score Dependence solely on hand crafted features leads to poor accuracy. Along with CNN another technique is required for mapping between the reviewer, product and reviews Regression 2. [7] R:LN R, C:Log Reg, Both: DT, RF, GBT, NN Yelp Shopping reviews Product, Review, Reviewer features RMSE, MSE, RAE, RSE, RRSE, MAE,R2 and CC (R), Accuracy, AUC, Precision, Recall, and F1 score (C) Authors examine the impact of friends on review usefulness by introducing social network features. For classification, reviews receiving more than 3 votes are marked as helpful, 0 votes as unhelpful and discarded otherwise Classification, Regression 3. [11] MLP, CNN SiteJabber.c om, ConsumerA ffairs.com (DomainsD ating, Wedding Dresses, Marketplace , Car Insurance, Travel Agencies, Mortgages) Review features Accuracy Adjacent or neighbor reviews impact a user’s helpfulness perception of a review. For classification, reviews receiving more than 2 helpful votes labeled as helpful and unhelpful otherwise. Classification 4. [27] Linear Suppo rt Vector Regre ssion, RF Regre ssion Yelp hotel stores reviews, Yelp food stores reviews Review features Pearson and Spearman correlation values Deviations in star ratings, review’s length and review’s polarity with respect to user and item impact usefulness. Authors do not consider reviewer features. Random Forest was a better helpfulness predictor. Integration of such an estimation model improves the CF system performance. Regression 5. [24] Multiv ariate adapti ve regres sion, ’C’ and ’R’ tree, RF, NN, deep NN Amazon multidomai n sentiment analysis dataset Review, Reviewer, Product features MSE, RMSE,RRSE Review type characteristics standout as effective characteristics to determine review’s helpfulness as compared to reviewer and product characteristics. Combining all three characteristics yield best performance. Regression 6. [2] DT, RF Amazon Product dataset (Books, Office Products) Review, Reviewer features Accuracy, F- measure Helpfulness threshold ratio set to value of 0.6. Features such as text, reviewer, and readability perform better than summary features. RF performed better than decision trees Classification 7. [26] MLP, CART Contributed dataset of Review, Reviewer, MSE, RAE, RMSE, More the comments, polarity and sentiments in a review, more are the Regression 278 Informatica 47 (2023) 275–284 D. Chehal et al. , Multiv ariate adapti ve regres sion, Gener alized Linear model , Ensem ble model 34 product categories from Amazon.co m Product features RRSE, MAE helpful votes. Reviews with at least 10 votes are selected. Best results were obtained using hybrid features with ensemble model performing the best 8. [34] RF Dataset from JD.com Review features, informativene ss, length Accuracy, AUC Classification threshold for search and experience products to be different. Threshold of 4 for search products such as electronics and 2 for experience products such as skin gave the best model performance Classification reviews for decision-making [27]. Table 1 lists the key takeaway points from the existing literature. In Table 1, AUC stands for Area Under the Curve, ’C’-Classification, CNN-Convolutional Neural Network, CC-Correlation Coefficient, DT-Decision Tree, GBT- Gradient Boosted Tree, Log Reg-Logistic Regression R- Regression, RAE-Relative Absolute Error, RF-Random Forest, RMSE-Root Mean Square Error, R2 -R Squared, RSE-Relative Squared Error, RRSE-Root Relative Squared Error, MAE-Mean Absolute Error, MLP-Multi Linear Perceptron, MSE-Mean Squared Error and NN- Neural Network. 3 Research methodology The review-based recommendation methods in the studied literature utilize review contents and do not incorporate the associated helpfulness or usefulness scores. Incorporating this information of reviews helps in better exploitation of the user reviews. Since, several reviews don’t have helpfulness scores, it is essential to predict the usefulness of these reviews [16]. The steps undertaken as part of prediction of usefulness of user reviews are shown in Figure 1 and are as follows: 3.1 Data collection Data collection and its processing are the initial steps in all the machine learning methods [36]. Amazon cell phone and accessories dataset has been considered for this task [29] [3]. As shown in Table 2, the dataset with (1048572, 12) size has the following columns: { "reviewerID": "A62MUEQU8I52E", "asin": "B007SJZUSI", "reviewerName": " H. Moyer ", "vote": 3, "style": { 'Color:': ' Gold' }, "reviewText": "Not a huge capacity power bank but a very good capacity for its very compact size. Exactly what I need, to have with me all of the time, just in case. One micro USB power input port for charging it, and one standard USB port for charging another device, either one using the same most standard cable in the industry. For most of us, power banks are for emergency only, so multiple output ports just add size unnecessarily. One state of charge gauge with 4 LED indicator lights, and one pushbutton. Very simple.", "overall": 5.0, "summary": " SIMPLE, COMPACT, AND POWERFUL FOR ITS SIZE ", "unixReviewTime": 1490659200.0, "reviewTime": " 03 28, 2017 ", “image”: nan, “verified”: True } Table 2: Dataset description Column name Column description reviewerID Specifies the reviewer’s unique identifier e.g. A284QS51P9P9V1 asin Specifies the product’s unique identifier e.g. B00UVSNVHA reviewerName Represents the name of the user/reviewer vote Represents the count of helpful votes received by a review style Represents a dictionary of the product metadata reviewText Implies the text contained in the review overall Represents the star rating given to a product summary Represents the textual summary of a product review unixReviewTi me Represents the time at which review was generated (unix time) reviewTime Represents the time at which review was generated (raw) image Represents the product images that users post when they review the product Predicting the Usefulness of E-Commerce Products’ Reviews… Informatica 47 (2023) 275–284 279 Figure 1: Research method 3.2 Pre-processing The dataset consists of 12 columns and 1048572 rows. In order to categorize the reviews as useful or useless the following pre-processing steps have been undertaken: 1. Out of the 12 columns available, only columns- overall, reviewText, summary and vote have been utilized. 2. Next, reviewText column has been converted to lower case and punctuation has been removed. 3. After performing the below mentioned feature engineering steps, stop words using Python’s nltk library have been removed. 4. Step 3 has been followed by stemming process in which porter stemmer has been used to apply stemming on reviewText column. 3.3 Feature engineering Apart from the columns considered during the pre- processing phase, below mentioned columns have been derived: 1. Word count: This column represents number of words in a review 2. Char count: This column indicates number of characters in a review 3. Avg word count: This column stands for average word length of a review 4. Sentiment score: This column represents polarity of a review ranging from minus one (indicating extremely negative) to plus one (indicating extremely positive) which has been determined with the help of Python’s vaderSentiment library 3.4 Preliminary analysis 1. The top ten most frequently occurring words, as shown in Table 3, after removal of stop words from the dataset are given below: Table 3 Top ten frequently occurring words Word Frequency Phone 165691 case 117779 one 62104 screen 57831 like 51122 use 43841 great 39611 battery 39595 would 38616 good 37078 As seen in Table 3, as the dataset is related to cell phones, the top ten frequently occurring words are related to this domain. The users have provided reviews mostly related to phone, case, screen and battery. To obtain these words, the frequency of words in the user reviews was obtained and then the top ten words were extracted. 2. The ten least frequently occurring words in the dataset, with only single occurrence are- Performancebattery, gummybearlike, amazonsunvalleytek, knive , terd, hh, nomy, 4siphone, Loosey, caseseems 3. The percentage of overall rating provided by users is provided in Table 4. The review dataset contains the majority of user reviews with the highest rating of the product, that is, 5, followed by user rating 4. The dataset contains more one- 280 Informatica 47 (2023) 275–284 D. Chehal et al. star ratings than compared to three-star and two- star ratings. Table 4 Distribution of user rating in the dataset Rating Count Percentage (%) 5 49894 55.02 4 15243 16.81 3 8094 8.93 2 5509 6.08 1 11942 13.17 Supervised learning algorithms require input and output examples for training the model. In order to predict the review usefulness, the target column has been contributed which identifies each review as useful or not. To help the classification models learn if a review is useful or useless all the reviews with more than 10 votes have been marked as useful else useless. 3.5 Model selection Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), AdaBoost (ADA), Gradient Boost (GB), Extra Trees (ET), k Nearest Neighbor (KNN) and Linear Discriminant Analysis (LDA) were used to categorize the usefulness of user reviews [23], [4]. All the models have been implemented in Python using pycaret library. 3.6 Data setup Classification estimators were used in this study to predict the user review’s usefulness. The target type is binary, with two possible values as useful or useless. The data has been partitioned into 70:30 partitions to obtain the training and testing sets. To allow row shuffling during the train-test split, the data split shuffle was set to true. The predictive models' performance was evaluated using stratified ten-fold cross-validation. 4 Result and discussion Usefulness is treated as dependent variable and overall, reviewText, summary, vote, word count, character count, average word length and sentiment score are treated as independent variables. The model's performance can be assessed using a variety of evaluators, some of which are more appropriate than others. The models have been assessed in terms of accuracy (calculated using (1)), area under the curve, precision (calculated using (2)), recall (calculated using (3)), f1-score (calculated using (4)), kappa score (calculated using (5)) and MCC (calculated using (6)) as shown in Table 5. The Table also displays the time taken (TT) in seconds for the models to be trained. • Accuracy: It is the most widely used performance metric and is calculated as the number of correct predictions over all predictions [33]. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ( 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 +𝑇𝑁 +𝐹𝑃 +𝐹𝑁 ) ∗ 100 (1) Where, TP stands for true positive, TN stands for true negative, FP stands for false positive and FN stands for false negative. • Area under the Curve: The plot of sensitivity versus (1-specificity) is given by Receiver Operating Characteristic curve. AUC converts the curve to a numeric value. The ranges of the curve and their corresponding interpretations are grouped as excellent for range varying from 1 to 0.90; good from 0.90 to 0.80; fair from 0.80 to 0.70; poor from 0.70 to 0.60 and fail from 0.60 to 0.50. • Precision: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 /𝑇𝑃 + 𝐹𝑃 (2) • Sensitivity: Sensitivity is the ratio of actually true classes that are identified correctly. Another name for sensitivity is true positive rate or recall. To reframe, it measures how often true predictions are correct. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 +𝐹𝑁 (3) • F1 Score: It’s an accuracy metric that considers the trade-off between precision and recall. 𝐹 1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙 ) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑅𝑒𝑐𝑎𝑙𝑙 (4) • Kappa: The Kappa score handles multi-class as well as imbalanced class problems. 𝐾𝑎𝑝𝑝𝑎 = 𝑝 𝑜 – 𝑝 𝑒 /1 − 𝑝 𝑒 (5) Where, p o and p e denote the observed and expected agreement, respectively. In general, it reflects how a classifier performs as compared to another classifier that simply guesses at random based on each class’s frequency. Cohen's kappa is never greater than 1. When the value of kappa is zero, the classifier is useless. • Matthews Correlation Coefficient (MCC): The Matthews correlation coefficient assesses the quality of a binary classification problem; it is a balanced measure for unbalanced dataset as well. It outputs a value between minus one and plus one where, plus one indicates complete agreement between predicted and observed value, minus one indicates total disagreement, and zero value indicates random predicted values [33]. Predicting the Usefulness of E-Commerce Products’ Reviews… Informatica 47 (2023) 275–284 281 𝑀𝐶𝐶 = 𝑇𝑃 ∗𝑇𝑁 −𝐹𝑃 ∗𝐹𝑁 √(𝑇𝑃 +𝐹𝑃 )(𝑇𝑃 +𝐹𝑁 )(𝑇𝑁 +𝐹𝑃 )(𝑇𝑁 +𝐹𝑁 ) (6) Table 5: Performance of ML models Model Accuracy AUC Recall Precision F1-Score Kappa MCC TT (sec) LR 0.99 0.99 0.99 0.99 0.99 0.99 0.99 11.5 DT 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.19 RF 0.99 0.99 0.99 0.99 0.99 0.99 0.99 2.47 ADA 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.23 GB 0.99 0.99 0.99 0.99 0.99 0.99 0.99 6.48 ET 0.9657 0.9995 0.9997 0.9612 0.98 0.8596 0.8683 9.14 KNN 0.9263 0.9471 0.9912 0.9263 0.9576 0.6762 0.7004 1.92 LDA 0.5788 0.517 0.6078 0.6427 0.6233 0.3348 0.3439 27.73 As shown in Table 5 and Figure 2, most of the classification models are performing decently when contrasted according to the evaluation parameters. In order to test the model’s robustness, ten-fold cross- validation has been employed. Due to lack of sufficient system RAM, the model has been fed a random sample of 5000 rows, leading to the above performance. Also, the methods’ black-box state diminishes the results’ interpretability. In comparison to others, LDA is unable to provide a reasonable prediction. The models have been trained again after performing feature selection and outlier removal to check the improvement in their performance. The near perfect performance of these models can be attributed to the size of data being fed to these models. Decision Tree model takes the least amount of time i.e. 0.19 seconds for generating the above results. Figure 2: Performance of ML models In order to improve the performance and reduce the training time of the above models, feature selection has been performed. Upon performing feature selection, the accuracy of LDA model jumps to 0.8411, AUC increases to 0.732, recall, precision, f1-score, kappa and MCC turn out to be 0.892, 0.842, 0.866, 0.638 and 0.658 respectively. The threshold value used for feature selection is set to 0.8 and the classic method of permutation feature importance techniques is used. Even after performing feature selection, the performance of LR, DT, RF, ADA, and GB classifiers remains unaffected as shown in Table 6. As seen from Table 5 and Table 6, the training time of all the models reduced. Training time of model- LR reduced to 6.32 from 11.5 (without feature selection), DT remained the same as 0.19, ADA classifier remained the same as 0.23, ET reduced to 9.05 from 9.14, KNN reduced to 1.90 from 1.92 and LDA reduced to 26.75 from 27.73 seconds. Only two models RF and GB had their training time increased to 2.62 from 2.47 and 6.51 from 6.48 respectively. This increase 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 LR DT RF ADA GB ET KNN LDA Parameter Values Machine Learning Models Performance of ML models Accuracy AUC Recall Precision F1-Score Kappa MCC 282 Informatica 47 (2023) 275–284 D. Chehal et al. Table 6: Performance of ML models after feature selection Model Accuracy AUC Recall Precision F1-Score Kappa MCC TT (sec) LR 0.99 0.99 0.99 0.99 0.99 0.99 0.99 6.32 DT 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.19 RF 0.99 0.99 0.99 0.99 0.99 0.99 0.99 2.62 ADA 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.23 GB 0.99 0.99 0.99 0.99 0.99 0.99 0.99 6.51 ET 0.9634 0.99 0.99 0.959 0.979 0.848 0.858 9.05 KNN 0.9729 0.991 0.996 0.973 0.984 0.892 0.895 1.90 LDA 0.8411 0.732 0.892 0.842 0.866 0.638 0.658 26.75 Table 7 represents performance of classifiers after removal of outliers. Outliers from the training data have been reduced using the Singular Value Decomposition and the outlier threshold has been set to 0.05, that is, five percent of the outliers have been removed from the training dataset. Again, the performance of LR, DT, RF, ADA, and GB classifiers remained unaffected. While the accuracy of ET and KNN classifiers increased, that of LDA decreased significantly. This implies that ET, KNN and LDA classifiers are affected due to removal of outliers whereas the rest of the classifiers are not affected with this processing step. Table 7: Performance of ML models after outlier removal Model Accuracy AUC Recall Precision F1-Score Kappa MCC TT (sec) LR 0.99 0.99 0.99 0.99 0.99 0.99 0.99 6.25 DT 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.18 RF 0.99 0.99 0.99 0.99 0.99 0.99 0.99 2.37 ADA 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.22 GB 0.99 0.99 0.99 0.99 0.99 0.99 0.99 6.42 ET 0.9759 0.9999 0.99 0.9726 0.9861 0.8969 0.9019 6.43 KNN 0.9801 0.9938 0.9979 0.9793 0.9885 0.9168 0.9192 1.83 LDA 0.7523 0.7173 0.7673 0.8541 0.806 0.4644 0.4877 23.48 As shown in Table 5, Table 6 and Table 7, LR, DT, RF, ADA and GB are performing perfectly for the sample dataset provided to the models with and without feature selection and outlier removal process. LDA model showed performance improvement after feature selection process, but degradation after outlier removal and accuracy of ET and KNN models improved after removal of outliers. 5 Limitations and future work In this study, review usefulness prediction models were built and compared using collected features from the publicly available Amazon’s cell phone and accessories dataset such as overall, reviewtext, summary, and vote, as well as derived features such as word count, character count, average word count, and sentiment score. Seven different performance measures namely, accuracy, area under the curve, precision, recall, f1-score, Kappa score and MCC were used to compare eight machine learning models- Logistic Regression, Decision Tree, Random Forest, AdaBoost, Gradient Boost, Extra Trees, K nearest Neighbor and Linear Discriminant Analysis. All the classification models performed well except LDA. Feature selection and outlier removal techniques had no effect on the classification performance of Logistic Regression, Decision Tree, Random Forest, AdaBoost, and Gradient Boost. The performance of LDA improved after feature selection but decreased after outlier removal, whereas ET and KNN depicted improvement in both cases. The results of this research can assist e-commerce platforms in gaining more clarity of the usefulness of online reviews. They can automatically analyze the usefulness of product reviews by utilizing prediction models as stated above. A system that uses ML models to predict useful reviews will benefit all stakeholders, including end users, product owners, and e-commerce platform regulators. In cases where the review has received no votes from people, such a system would be beneficial. In that instance, stakeholders might utilize the models' predictions to find reviews of interest or usefulness. This would ultimately save a significant amount of time spent reviewing the enormous number of available user reviews. This study was limited due to lack of sufficient system RAM; the models were fed a random sample of 5000 rows. Also, the methods’ black- box state diminishes the results' interpretability. The study can be strengthened by improving the prediction models by removing fake reviews, incorporating emoticons for online review helpfulness prediction, employing unsupervised learning techniques instead of supervised learning, and developing deep learning models. Availability of data The dataset is available through URL: https://jmcauley.ucsd.edu/data/amazon/ References [1] Ahmad, S.N. and Laroche, M. 2017. Analyzing electronic word of mouth: A social commerce Predicting the Usefulness of E-Commerce Products’ Reviews… Informatica 47 (2023) 275–284 283 construct. International Journal of Information Management. 37, 3 (Jun. 2017), 202–213. DOI:https://doi.org/10.1016/J.IJINFOMGT.2016 .08.004. [2] Akbarabadi, M. and Hosseini, M. 2020. Predicting the helpfulness of online customer reviews: The role of title features. International Journal of Market Research. 62, 3 (2020), 272– 287. DOI:https://doi.org/10.1177/1470785318819979. [3] Amazon Review Data: 2018. https://jmcauley.ucsd.edu/data/amazon/. Accessed: 2021-05-14. [4] Ampomah, E.K. et al. 2021. Stock market decision support modeling with tree-based AdaBoost ensemble machine learning models. Informatica. 44, 4 (Mar. 2021), 477–489. DOI:https://doi.org/10.31449/inf.v44i4.3159. [5] Arif, M. et al. 2019. A Survey of Customer Review Helpfulness Prediction Techniques. Advances in Intelligent Systems and Computing. Springer International Publishing. 215–226. [6] Bilal, M. et al. 2019. Profiling and predicting the cumulative helpfulness (Quality) of crowd- sourced reviews. Information (Switzerland). [7] Bilal, M. et al. 2021. Profiling reviewers’ social network strength and predicting the “Helpfulness” of online customer reviews. Electronic Commerce Research and Applications. 45, (Jan. 2021), 101026. DOI:https://doi.org/10.1016/j.elerap.2020.10102 6. [8] Chen, C. et al. 2019. Multi-domain gated CNn for review helpfulness prediction. The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. (2019), 2630–2636. DOI:https://doi.org/10.1145/3308558.3313587. [9] Chua, A.Y.K. and Banerjee, S. 2016. Helpfulness of user-generated reviews as a function of review sentiment, product type and information quality. Computers in Human Behavior. 54, (2016), 547– 554. DOI:https://doi.org/10.1016/j.chb.2015.08.057. [10] Dey, D. and Kumar, P. 2019. A novel approach to identify the determinants of online review helpfulness and predict the helpfulness score across product categories. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer. 365–388. [11] Du, J. et al. 2021. Neighbor-aware review helpfulness prediction. Decision Support Systems. April (2021), 113581. DOI:https://doi.org/10.1016/j.dss.2021.113581. [12] Enamul Haque, M. et al. 2018. Helpfulness prediction of online product reviews. Proceedings of the ACM Symposium on Document Engineering 2018, DocEng 2018. (2018). DOI:https://doi.org/10.1145/3209280.3229105. [13] Eslami, S.P. et al. 2018. Which online reviews do consumers find most helpful? A multi-method investigation. Decision Support Systems. 113, (Sep. 2018), 32–42. DOI:https://doi.org/10.1016/J.DSS.2018.06.012. [14] Fan, M. et al. 2019. Product-aware helpfulness prediction of online reviews. The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019. 2, Ccl (2019), 2715–2721. DOI:https://doi.org/10.1145/3308558.3313523. [15] Fink, L. et al. 2018. Longer online reviews are not necessarily better. International Journal of Information Management. 39, (Apr. 2018), 30– 37. DOI:https://doi.org/10.1016/J.IJINFOMGT.2017 .11.002. [16] Ge, S. et al. 2019. Helpfulness-aware review based neural recommendation. CCF Transactions on Pervasive Computing and Interaction. 1, 4 (Dec. 2019), 285–295. DOI:https://doi.org/10.1007/s42486-019-00023- 0. [17] Hamad et al. 2018. Review helpfulness as a function of Linguistic Indicators. IJCSNS International Journal of Computer Science and Network Security. 18, 1 (2018), 234–240. [18] Hong, H. et al. 2017. Understanding the determinants of online review helpfulness: A meta-analytic investigation. Decision Support Systems. 102, (2017), 1–11. DOI:https://doi.org/10.1016/j.dss.2017.06.007. [19] Kaddoura, S. et al. 2022. A systematic review on machine learning models for online learning and examination systems. PeerJ Computer Science. 8, (May 2022), e986. DOI:https://doi.org/10.7717/PEERJ-CS.986. [20] Kong, L. et al. 2022. Predicting Product Review Helpfulness - A Hybrid Method. IEEE Transactions on Services Computing. 15, 4 (2022), 2213–2225. DOI:https://doi.org/10.1109/TSC.2020.3041095. [21] Krishnamoorthy, S. 2015. Linguistic features for review helpfulness prediction. Expert Systems with Applications. 42, 7 (May 2015), 3751–3759. DOI:https://doi.org/10.1016/J.ESWA.2014.12.04 4. [22] Liu, A.X. et al. 2019. It’s Not Just What You Say, But How You Say It: The Effect of Language Style Matching on Perceived Quality of Consumer Reviews. Journal of Interactive Marketing. 46, (May 2019), 70–86. DOI:https://doi.org/10.1016/J.INTMAR.2018.11. 001. [23] Luo, Y. and Xu, X. 2019. Predicting the helpfulness of online restaurant reviews using different machine learning algorithms: A case study of yelp. Sustainability (Switzerland). 11, 19 (2019). DOI:https://doi.org/10.3390/su11195254. [24] Malik, M.S.I. 2020. Predicting users’ review helpfulness: the role of significant review and 284 Informatica 47 (2023) 275–284 D. Chehal et al. reviewer characteristics. Soft Computing. 24, 18 (Sep. 2020), 13913–13928. DOI:https://doi.org/10.1007/s00500-020-04767- 1. [25] Malik, M.S.I. and Hussain, A. 2018. An analysis of review content and reviewer variables that contribute to review helpfulness. Information Processing and Management. 54, 1 (2018), 88– 104. DOI:https://doi.org/10.1016/j.ipm.2017.09.004. [26] Malik, M.S.I. and Hussain, A. 2020. Exploring the influential reviewer, review and product determinants for review helpfulness. Artificial Intelligence Review. 53, 1 (2020), 407–427. DOI:https://doi.org/10.1007/s10462-018-9662-y. [27] Mauro, N. et al. 2021. User and item-aware estimation of review helpfulness. Information Processing and Management. [28] Mitra, S. and Jenamani, M. 2021. Helpfulness of online consumer reviews: A multi-perspective approach. Information Processing and Management. 58, 3 (2021), 102538. DOI:https://doi.org/10.1016/j.ipm.2021.102538. [29] Ni, J. et al. 2020. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. (2020), 188–197. DOI:https://doi.org/10.18653/v1/d19-1018. [30] Orimaye, S.O. et al. 2016. Learning Sentiment Dependent Bayesian Network Classifier for Online Product Reviews. Informatica (Slovenia). 40, 2 (2016), 225–235. [31] Salehan, M. and Kim, D.J. 2016. Predicting the performance of online consumer reviews: A sentiment mining approach to big data analytics. Decision Support Systems. 81, (Jan. 2016), 30– 40. DOI:https://doi.org/10.1016/J.DSS.2015.10.006. [32] Saumya, S. et al. 2018. Ranking online consumer reviews. Electronic Commerce Research and Applications. 29, (2018), 78–89. DOI:https://doi.org/10.1016/j.elerap.2018.03.008 . [33] Sidhu, R.K. et al. 2020. Machine learning based crop water demand forecasting using minimum climatological data. Multimedia Tools and Applications. 79, 19–20 (2020), 13109–13124. DOI:https://doi.org/10.1007/s11042-019-08533- w. [34] Sun, X. et al. 2019. Helpfulness of online reviews: Examining review informativeness and classification thresholds by search products and experience products. Decision Support Systems. 124, (Sep. 2019), 113099. DOI:https://doi.org/10.1016/J.DSS.2019.113099. [35] Wu, J. 2017. Review popularity and review helpfulness: A model for user review effectiveness. Decision Support Systems. 97, (2017), 92–103. DOI:https://doi.org/10.1016/j.dss.2017.03.008. [36] Yenkikar, A. et al. 2022. Semantic relational machine learning model for sentiment analysis using cascade feature selection and heterogeneous classifier ensemble. PeerJ Computer Science. 8, (Sep. 2022), e1100. DOI:https://doi.org/10.7717/PEERJ-CS.1100.