UPORABNA INFORMATIKA 83 2021 - πtevilka 2 - letnik XXIX zNANStVENI prISpEVkI LeaRNING-BaSed LINk PRedICTION ANAL YSIS FOR F ACeBOOK100 NeTWORK T im Poštuvan 1 , 2 , Semir Salkić 1 , Lovro Šubelj 1 1 University of Ljubljana, Faculty of Computer and Information Science, V ečna pot 113, 1000 Ljubljana, Slovenia 2 University of Ljubljana, Faculty of Mathematics and Physics, Jadranska ulica 19, 1000 Ljubljana, Slovenia tim.postuvan@gmail.com, ss9343@student.uni-lj.si, lovro.subelj@fri.uni-lj.si Izvleček Facebook je ena izmed najzanimivejših in najširše uporabljenih socialnih in medijskih platform. Njegovi podatki so konkretno vplivali na raziskovanje socialnih omrežij in na tehnike napovedovanja povezav , ki so nepogrešljive pri analizi omrežij. Članek prvi natančno analizira napovedovanja povezav na Facebook100 omrežju. Na različnih množic značilk preučimo zmogljivost večih algoritmov stroj- nega učenja. Za pridobitev značilk uporabimo vložitve omrežij in tehnike, ki temeljijo na topologiji omrežja, kot so node2vec in vektor metrik podobnosti. Poleg tega uporabimo tudi značilke vozlišč, ki so na voljo za Facebook 100 omrežje, a le redko najdene pri drugih naborih podatkov . Razložimo tudi uporabljene pristope in nazorno predstavimo rezultate. Na koncu primerjamo naše modele, kjer podamo njihove končne zmogljivosti in klasifikacijske natančnosti. Ključne besede: napovedovanje povezav , socialna omrežja, klasifikacija, nadzorovano učenje, izbira značilk Abstract In social network science, Facebook is one of the most interesting and widely used social networks and media platforms. Its data has significantly contributed to the evolution of social network research and link prediction techniques, which are important tools in link mining and analysis. This paper gives the first comprehensive analysis of link prediction on the Facebook100 network. W e stu- dy performance and evaluate multiple machine learning algorithms on different feature sets. T o derive the features, we use network embeddings and topology-based techniques such as node2vec and vectors of similarity metrics. In addition, we also employ node- -based features, which are available for the Facebook100 network, though rarely found in other datasets. The adopted approaches are discussed and results are clearly presented. Lastly , we compare and review the applied models, where overall performance and classification rates are presented. Keywords: Link prediction, social networks, classification, supervised learning, feature selection 1 INTROdUCTION Social networks became an important focus in our research. We are witnessing exponential user expan- sion on social platforms (e.g. Facebook, Twitter and LinkedIn). People are joining these platforms and generating substantial amounts of data, which can reveal interesting clues about user behavior, socie- ty and psychology. We can see significant increase in research on the topic of social networks and link prediction. In the last decade we are experiencing a rise in this overlapping topic of network science and data science, which is primarily used to analyze and understand social networks [Wang et al., 2014]. Ac- cording to Facebook estimations in March 2020, they have 1.73 billion DAU (Daily Active Users). Much work has been done to understand complexities and challenges of social networks, where considerable knowledge has been obtained ([Pow et al., 2012], [Scott, 1988]). In that manner we are using the Fa- cebook100 dataset (2005), which includes a complete set of people from Facebook networks for 100 diffe- rent colleges and universities in the USA. This paper UPORABNA INFORMATIKA 84 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK is the first comprehensive study of link prediction te- chniques applied to the Facebook100 dataset. We are using a learning-based approach with local probabi- listic models such as logistic regression. Besides that, we are utilizing standardized link prediction classi- fication models such as: random forests, artificial ne- ural networks and kernel-based models (e.g. SVM). Having in mind that we have two main approaches to link prediction, we chose learning-based approa- ches instead of similarity-based approaches because we want to compare performance, stability and clas- sification accuracy of multiple classification models instead of showcasing different measures of node proximity. We are using standardized similarity-ba- sed methods to derive notable features, which are used as model input. The presented methods are no- de2vec graph embedding and an ensemble of topolo- gy-based metrics such as Jaccard Coefficient, Adamic Adar Coefficient and Preferential Attachment Index. Using these methods, we build classification models which are clearly presented, focusing on performan- ce and generalization, with appropriate discussion and results. The paper is structured as follows. In section 2 we present related work in the field of link prediction on social networks. Section 3 briefly describes the Face- book100 data. In sections 4 and 5 we present used feature sets and datasets, which are utilized for tra- ining and testing. Process and discussion of feature selection for each dataset are stated in the section 6. Sections 7 and 8 give an overview of achieved results, while also providing comprehensive model analysis. Insightful discussion of results which explains intere- sting nature of the data is explained in section 9. We conclude the paper with an overview in section 10. 2 ReLA TeD WORK Link prediction has recently become very popular for prediction of future relationships between indivi- duals of social networks. Consequently, a great varie- ty of different approaches were invented. In the past decade, many efforts have been made by psychologi- sts, computer scientists, physicists and economists to solve the link prediction problem in social networks. According to Wang et al. [Wang et al., 2014] there are two ways to predict links: similarity-based approa- ches and learning-based approaches. Similarity-ba- sed approaches calculate a similarity score for every pair of nodes, where higher score means higher pro- bability that the corresponding nodes will be connec- ted in the future. Learning-based approaches are treating the link prediction problem as a binary classification task [Hasan et al., 2006]. Therefore, typical machine le- arning models can be employed for solving the pro- blem. These include classifiers like random forest [Bre- iman, 2001], multilayer perceptron or support vector machine (SVM) [Cortes and Vapnik, 1995], as well as probabilistic models. The learning-based approaches use non–connected pairs of nodes as instances with features describing nodes and the class label. Pairs of nodes which have potential to become connected are labeled as positive and the others as negative. Their feature set consists of similarity features from the similarity-based approaches and features derived from domain knowledge (e.g. textual information about members of social networks). Using combina- tion of both can remarkably improve the link predic- tion performance. Scellato et al. [Scellato et al., 2011] considered social features, place features and global features in location-based social networks for link pre- diction based on a supervised learning framework. Both types of approaches rely on various metrics, which use information of nodes, topology of network and social theory to calculate similarity between a pair of nodes. Metrics consist of three categories: node-ba- sed, topology-based and social theory-based metrics. Node-based metrics use the attributes and actions of individuals to assess similarity of node pairs. They are very useful in link prediction; however, it is usu- ally hard to get the data because of privacy issues. Most metrics are based on the topological infor- mation and are called topology-based metrics. They are most commonly used for prediction, because they are generic and domain independent. Topology- -based metrics are further divided into the following subcategories: neighbor-based, path-based and ran- dom walk-based metrics. Neighbour based metrics assume that people tend to form new relationships with people that are closer to them. The most famo- us are Common Neighbors [Newman, 2001], Jaccard Coefficient [Salton and McGill, 1986], Adamic Adar Coefficient [Adamic and Adar, 2003] and Preferenti- al Attachment Index [Barabási et al., 2002]. The first three all use the same idea that two nodes are more likely to be connected if they share a lot of common neighbours. On the other hand, Preferential Atta- chment Index assumes that nodes with higher de- UPORABNA INFORMATIKA 85 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK gree have higher probability of forming new edges. Neighbor-based metrics capture local neighbor- hood but do not consider how nodes are reachable from one another. Path-based metrics incorporate this information by considering paths between no- des. They are more suitable to small networks and are not scalable to big networks. Examples of path- -based metrics are Local Path [Lü et al., 2009] and Katz metric [Katz, 1953]. Local Path metric makes use of information of local paths with length two and three, while giving more importance to the paths of length two. Katz metric calculates the similarity by summing all the paths connecting the two nodes, giving higher weight to shorter paths. Social interactions between members of social ne- tworks can also be modeled by random walk, which uses transition probabilities from a node to its neigh- bors to denote the destination of a random walker from the current node. Examples of random walk-ba- sed metrics are Hitting Time and SimRank [Jeh and Wi- dom, 2002]. Hitting time metric calculates similarity based on the expected number of steps required for a random walk starting at a node to reach the other node. SimRank metric computes similarity according to the assumption that two nodes are alike if they are connected to structurally similar nodes. Social theory-based metrics take advantage of clas- sical social theories, such as community, triadic closu- re, strong and weak ties and homophily, improving performance by capturing additional social interacti- on information. Liu et al. [Liu et al., 2013] proposed a link prediction model based on weak ties and degree, closeness and betweenness node centralities. When designing a feature set, the choice of features tremendously influences the performance of link pre- diction. Sometimes is it hard to find appropriate fe- atures, hence it is desirable that an algorithm learns important features on its own. Network embedding methods aim at learning low-dimensional latent representation of nodes in a network. Embeddings should follow the principle that similar nodes in the network have similar embedding representations. The advantage of node embedding as a technique is enormous since it does not require feature enginee- ring by domain experts. Network embeddings me- thods can be broadly categorized into four classes: methods based on random walks, matrix factoriza- tion, neural networks, and probabilistic approaches. For the purpose of this paper methods based on ran- dom walks are the most relevant. Methods based on random walks determine si- milarities using random walks on the original net- work. The Skip–Gram model, described in Mikolov et al. [Mikolov et al., 2013], is then usually used to generate node embeddings from the random walks. Examples of such methods are DeepWalk [Perozzi et al., 2014] and node2vec [Grover and Leskovec, 2016]. DeepWalk was the first technique for network em- beddings, inspired by deep learning. It uses random walks with fixed transition probabilities to measure node similarity, while embeddings are derived using the Skip-Gram model. Node2vec is a generalization of DeepWalk which uses supervised random walks for node neighbourhood exploration. The random walk is controlled by a return parameter p and an in- -out parameter q. Then similarly Skip–Gram model is used, but this time approximated via negative sam- pling, for embedding generation. Evaluation of the methods plays the crucial role in machine learning task in general. T o estimate perfor- mance of the link prediction approaches more evalu- ation criteria exists. While some papers utilize Preci- sion@N p for a range of N p values [Zhang et al., 2018], others use AUROC [Grover and Leskovec, 2016]. 3 daTa We study the Facebook social network of friendships at one hundred American colleges and universities at a single moment of time in September 2005 ([Traud et al., 2012], [Traud et al., 2011]). The network con- sists of one hundred independent networks, where every network corresponds to one university. Frien- dships are recorded only between people from the same university. Besides the information about frien- dships, network also contains limited demographic information. The following information is available for each user: student/faculty status flag, gender, ma- jor, second major/minor (if applicable), dorm/house, year and high school. Network is unweighted and undirected. The whole network consists of 3.2 milli- on nodes with 23.7 million links between them [Ros- si and Ahmed, 2015]. Maximum degree of a single node is approximately 4900 and minimum degree is only 1, with an average of 15. According to statistics network appears to be disassortative but this is only the consequence of its size. It also has high average clustering coefficient 0.097, which is characteristic of social networks. UPORABNA INFORMATIKA 86 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK T abela 1: Structural information of train and unseen data dataset n m d C Train 4943.5 206247.6 77.26787 0.2808 Unseen 3517.8 140793.2 80.3072 0.2689 Table 1 contains structural measurements for used network sets. All of the presented measurements are averaged over all networks in the corresponding set. The presented values are: average clustering coeffici- ent (C), average degree (d), average number of nodes (n) and average number of edges (m). We can see by the number of nodes and edges that train set is larger than unseen dataset. Clustering coefficient and ave- rage degree are high, which is one of expected cha- racteristics of social networks. Because Facebook100 dataset is enormous, lack of computing power prevented us from considering the complete dataset for analysis. Therefore, we have de- cided that we would perform analysis only on a sub- set of networks. We selected ten networks as seen data and five networks as unseen data. Seen data was used for standard train and test set, where training and te- sting edges came from the same ten networks. On the other hand, unseen data consisted of five networks, which were not shown to the models during training. We used it to evaluate if our models are transferable to new data. Degree distribution of the analyzed networks is presented in figures 1 and 2. Degree distribution is plotted on a log-log scale for all networks in both sets respectively. For cleaner overview, we used interpo- lation (univariate spline) to showcase distribution of all networks. It is visible on both figures that all net- works follow power law, which is expected for social networks. Having in mind that we are using real life social networks, it can be concluded that they are sca- le-free networks by degree distribution results. Howe- ver, it is interesting to point out existence of big hubs, nodes with very high degree. They are visible on right side of the distribution graph. This is one of the rea- sons why interpolation at the end of the plot has an unexpected minimum. 4 FeaTURe SeT Feature engineering probably plays the most im- portant role when coping with a machine learning problem. Informative features crucially effect model accuracy; hence the process of feature engineering is usually very time consuming. In learning-based link prediction each pair of nodes is described using a combination of node-based, topology-based and node embedding features, depending on approach. In this paper we are using three different feature sets, from which three datasets were constructed as it will be described in the section 5. 4.1 Node-based features Node-based features use domain-specific informati- on about individuals. Facebook100 dataset has alre- ady a basic set of features, however, not all of them are useful for link prediction task. Almost all features had to be transformed, in order to describe node pa- irs, instead of individuals. From some features, for example dormitory information, new features had to created, because otherwise model would not be transferable between networks. Problem arises from Figure 1: Degree distribution of train set Figure 2: Degree distribution of unseen data set UPORABNA INFORMATIKA 87 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK the fact that different universities use different nu- merations of their dormitories. Considering the abo- ve constraints, we derived the following features:  same_dorm: binary value, indicating whether the nodes live in the same dormitory  same_year: binary value, indicating if the nodes started college in the same year  year_diff: numerical value, stating the absolu- te difference between the years, when the nodes started college  high_school_1, high_school_2: numerical values, stating indices of nodes’ high schools  major_1, major_2: numerical values, stating indi- ces of nodes’ majors  same_faculty: binary value, indicating whether the nodes have the same faculty status  same_gender: binary value, indicating if the no- des have the same gender Since networks are undirected, each pair of nodes must be uniquely represented using above features. Representation should not depend on order of the pair, thus major_1 and major_2 are ordered in a way that the value of major_1 is not greater than the value of major_2. The same holds for high_school_1 and high_school_2. Like the majority of datasets Facebook100 does not contain all information about all individuals. Therefore, missing values had to be handled. We de- cided that imputing is reasonable only for the featu- re year_diff, where missing values were substituted with the mean. Values of other features were left in- tact but as soon as one of the nodes in the pair had a missing value, the corresponding binary values was automatically zero. 4.2 T opology-based features The most commonly used features for link predicti- on are topology-based features. They are particularly useful, when you do not have any problem specific information, because they are generic and domain independent. Although Facebook100 dataset has ad- ditional domain specific data, topology-based featu- res still have great impact on model accuracy. In this paper we are using the following topology-based fe- atures:  Jaccard Coefficient [Salton and McGill, 1986]. Ja- ccard Coefficient normalizes the size of common neighbors. According to Jaccard Coefficient a pair of nodes is assigned a higher value when the no- des share a higher proportion of common neigh- bors relative to total number of their neighbours. JC(x, y) = |Г(x) ∩ Г(y)| |Г(x) ∪ Г(y)| (1) where Γ(x) is a set of neighbours of node x.  Adamic Adar Coefficient [Adamic and Adar, 2003]. Adamic Adar Coefficient measure is close- ly related to Jaccard Coefficient. It is calculated as a weighted sum of common neighbours, where common neighbours with fewer neighbours have greater impact. The rationale behind it is that high degree nodes are more likely to occur in common neighbourhood, thus they should contribute less than low degree nodes. AA(x, y) = Г(x) ∩ Г(y) 1 log|Г(z)| (2)  Preferential Attachment Index [Barabási et al., 2002]. The measure is based on the concept that nodes with higher degree have higher probability of forming new edges. PA (x, y) = |Г(x)|Г(y)| (3)  Resource Allocation Index [Zhou et al., 2009]. Re- source Allocation Index metric is very similar to Adamic Adar Index. The only difference is that Resource Allocation Index punishes high degree nodes more. RAI(x, y) = z ∈ Г(x) Г(y) 1 |Г(z)| (4) 4.3 Node embedding features Network embeddings methods aim to learn low-di- mensional latent representation of the nodes in a net- work. T o generate a dataset comprising of every node in a network we are able to use these representations as features. This can be used for a wide variety of tasks such as classification, clustering, link predicti- on, and visualization. Using node2vec [Grover and Leskovec, 2016] we were able to generate our embed- dings dataset. The key point is that node2vec is based on ran- dom node walks performed in a biased manner across the network. With this generic approach we are able to sample any network in a search for vec- tor representation of its structural properties. With UPORABNA INFORMATIKA 88 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK the introduction of search bias α we are able to con- trol our search in breadth-first search or depth-first search manner. If we choose «in-out parameter” (q) , walks are more biased to visit nodes further from the start node, thus expressing the nature of exploration. Fixing «return parameter” (p = 1) ensures that we are less likely to visit same node twice, which in return adopts the strategy of modern exploration (avoids 2-hop redundancy in sampling). As stated in the case study by Grover & Lesko- vec [Grover and Leskovec, 2016] for social structures it is beneficial to tune node2vec hyperparameters to discover communities of nodes which are interac- ting with each other. Capturing this type of beha- vior using embedding representation significantly benefits the link prediction task. To find the best set of hyperparameters, we employed grid search over more than 80 different settings that deemed reasona- ble to us. Each setting was evaluated on network of Caltech using logistic regression and the best com- bination of hyperparameters was selected. It is im- portant to note here, that we did not consider only AUROC of the logistic regression model, but also the complexity of the embedding. The embedding dimension should be as small as possible while car- rying all relevant information. The hyperparameters that were most consistent with the two criteria: 64 dimensions, 50 walks per node, q = 0.8 and 20 nodes in each walk. Since node2vec approach yields em- beddings for nodes, we used Hadamard product to express vector representations for edges. 5 daTaSeTS Firstly, we had to preprocess all graphs (seen as well as unseen) to obtain train and test node pairs, which will be predicted by models and using which performance will be evaluated. Node pairs can be either positive or negative instances for link prediction task, depending on whether there is an edge between the nodes or not (the nodes are friends or not). For every graph we used the standard approach of generating an incomple- te train graph G train = (V, E train ) from the original graph G = (V, E ). The connected node pairs {i, j} ∈ E\E train , which are present in the original graph but not inclu- ded in the train graph, are used as positive instances for link prediction task. Positive instances were sam- pled randomly from the original graph’s edge set E. We decided to sample 2% of edges in original graph G. Since dataset should contain positive as well as nega- tive instances, we had to obtain also negative instances – pairs of nodes that are not connected by an edge. It is assumed that if two nodes are not connected by an edge, Figure 3: Correlation matrix of node-based and topology-based features UPORABNA INFORMATIKA 89 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK they are not friends. Negative instances were obtained by randomly selecting node pairs {i, j} ∉ E, which are not in the original graph’s edge set. T o get a balanced da- taset the number of negative instances is the same as the number of positive ones. The rest of the graph (G train ) was used to calculate topology-based and node em- bedding features. Positive train and test edges were not included in this graph from which features were derived, otherwise we would introduce unjust label informa- tion in features. In our experiments, all unseen data instances were used for testing models’ ability to adapt to new graphs. However, seen data was further split into tra- in and test data. We used standard division: 80% of it was used as the train data and the remaining 20% was used as the test data. Within each of the them there is approximately the same number of positive and negative instances. Using this data three datasets were created: ba- seline, topological and embedding dataset. Each dataset represents node pairs using a different com- bination of features. Baseline dataset is the simplest one and contains only topology-based features. A bit more complex is topological dataset, which in addi- tion to the topology-based features makes use of no- de-based features as well. Node pairs in embedding dataset are described using node-based features and Hadamard product of the corresponding nodes’ em- beddings. 6 FeaTURe SeLeCTION Contemporary datasets usually have abundance of data, which is not always relevant to the problem. Hence, datasets should be preprocessed before mo- dels are used on them. Preprocessing takes place mainly to reduce the size of the dataset and achieve more efficient analysis, as well as removing redun- dant features, which have negative impact on the performance of the model. The aim of feature selec- tion is to maximize relevance and minimize redun- dancy of the features. Our feature sets are not enormous, thus feature selection was done solely for the sake of performance improvement of the models. We are using recursive feature elimination with cross-validated selection (RFECV) in combination with linear kernel support vector machine (SVM) to get reduced feature sets. This method recursively considers smaller and smal- ler sets of features, while after every iteration prunes the least important features according to the chosen model. It belongs to wrapper methods for feature se- lection, since it appraises subsets of features based on performance of the modelling algorithm. According to Jovic ´ et al. [Jovic et al., 2015] wrapper methods have been empirically proven to yield better results than other methods because subsets are evaluated using real models. 6.1 Baseline dataset The above feature selection method recognized Ada- mic Adar Coefficient, Jaccard Coefficient and Reso- urce Allocation Index as the most informative fe- atures. The most relevant feature is Adamic Adar Coefficient and the least relevant one is Preferential Attachment Index. This is completely coherent with random forest feature importance shown in figure 4. Adamic Adar Coefficient is the most relevant feature, although Jaccard Coefficient has higher correlation with labels, which can be observed in figure 3. All selected features are highly correlated with label, whereas Preferential Attachment Index is not. This is probably the reason why Preferential Attachment In- dex is the only feature which was not selected. From correlation matrix it is also evident that Adamic Adar Coefficient and Resource Allocation Index are almost perfectly correlated, which is expected because of the similarity in their definitions. Nonetheless, adding it results in a slightly better performance, thus the al- gorithm decides to keep it. Figure 4: Feature importance of baseline dataset according to random forest classifier UPORABNA INFORMATIKA 90 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK 6.2 Topological dataset The advantages of feature selection are more evident on topological dataset, because it has more features. This time algorithm selected the following features: all four topology-based features, same year, same facul- ty, same dorm, major_1 and major_2. On figure 5 it is clearly shown that topology-based features are far more important than other node-based features. This is also consistent with correlation matrix, since to- pology- based features have the highest correlations with labels. They are so informative due to phenome- non called triadic closure. The triadic closure states that in social networks connections tend to form be- tween people who share common friends, which is precisely what these topology-based features are de- scribing. Among the node-based features same year, same faculty and same dorm were selected, all having relatively high correlation with label. Particularly high correlation has same year, which is expected, as college students often form friendships with their classmates. Because of this major_1 and major_2 are also relevant. Feature same faculty exploits the fact that students’ and professors’ social circles are rarely overlapping. 6.3 embedding dataset Feature selection on embedding dataset was espe- cially hard due to artificial features from node2vec. Because hyperparameters of node2vec were care- fully tuned, we assumed that node embeddings are optimal. We selected the minimum embedding size which still performed well, while the quality of a set of hyperparameters was evaluated on the link pre- diction task. This was the best we could do with avai- lable computational resources, since there were too many features to utilize the feature selection method described above. Hence, we filtered only node-based features [Liao et al., 2017] and are using only a few crucial ones: same year and same dorm. We did not select same faculty, although it is more important than same dorm if considered on its own. We decided so, because same faculty has high correlation with same year and correlated features usually have negative impact on performance of the model. 7 ReSUL TS Evaluation of our datasets was conducted using an ensemble of classification models. We used simpler models like logistic regression and random forest, as well as more complex ones – support vector machi- nes (SVM) and neural networks (NN), which are ca- pable of modeling more complex non-linear functi- ons. Hyperparameters of all models were optimized as described in section 8. Link prediction task was tested on all three datasets, on test and unseen data, and all aforementioned models. Performance of the models was evaluated using Area Under the Rece- iver Operating Characteristics (AUROC), which is one of the most common evaluation metrics for link prediction. T abela 2:AUROC values for logistic regression, random forest, support vector machine (Svm) and neural network (NN) on test data dataset Logistic regression Random forest Svm NN Baseline 0.9401 0.9227 0.9628 0.9618 Topological 0.9570 0.9173 0.9639 0.9623 embeding 0.9365 0.9145 0.9412 0.9389 Figure 5: Feature importance of topological dataset according to random forest classifier Table 2 contains AUROC scores for all combinati- ons of the datasets and models on the test data. Si- milarly, table 3 states the same values, but on unseen data. These tables reveal that support vector machine (SVM) and neural network (NN) are the best models for the link prediction task. Their performance is al- most exactly the same, although they are based on completely different concepts. This is indicating that all relevant information from datasets is used for prediction. Only a little worse did logistic regression, which is very surprising, since it is much simpler than SVM and NN. Even more unexpected is that it outper- formed random UPORABNA INFORMATIKA 91 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK forest, which is a non-linear model. We hypothesize that this is a consequence of linear separability of the data, but more about this will be written in section 9. All models appear to be stable, since there is only a slight decrease in performance, when applied to unseen data. Difference is negligible for baseline and topological dataset, whereas noteworthy on embed- ding dataset. The models were able to extract more useful in- formation from topological dataset than baseline and embedding ones. Baseline dataset has only a bit worse results, showing that additional node-based features have minimal influence on performance. Difference is visible for logistic regression, whereas SVM and NN have the same score on both datasets. Shockingly, em- bedding dataset gets the worst results. However, this might be the consequence of the chosen evaluation metric. For example, logistic regression on embedding dataset gets 0.87 F 1 score, while baseline and topologi- cal datasets get only 0.76 and 0.79. F 1 score reflects ba- lance between recall and precision, whereas AUROC does not consider the predicted labels, but only mea- sures whether predicted values of negative instances are smaller than predicted values of positive instan- ces. Consequently, metrics are not necessarily cohe- rent, however, we decided to trust AUROC since it is a more standard link prediction metric. 8 mOdeL aNaL ySIS Here we used previously obtained data and results to optimize our models. Prior to that, data was standar- dized to have variance 1.0 and mean 0.0. With this approach, we have done model analysis to interpret best combinations of hyperparameters, which are useful to understand and discover patterns in data. 8.1 Logistic regression Using grid search cross-validation on logistic regressi- on we saw that different approaches require different configurations. In the case of the baseline approach features equally impact the decision process, which is reflected in features’ coefficients. For this dataset we used ridge regularization (L 2 ) with default regulariza- tion strength of 1. In the case of topological dataset, we noticed that regularization significantly influences performance. We used lasso regularization (L 1 ) regularization with immense regularization strength of 1000. In this case regularization is crucial for prevention of overfitting. Having in mind that our features are measurements which are not calculated in a fixed interval, we are be- nefiting from the L 1 property of data sparsity. Model coefficients are imbalanced, where major study featu- res (i.e. major_1 and major_2) are given low coefficient values, which shows that most of the information is contained in the rest of the features. In embedding dataset we noticed that addition of node-based data does not have any benefits. This co- uld be the consequence of correlations within features of embedding dataset. Maybe node2vec features con- tain structural information, using which node-based feature can be well approximated. For embedding dataset L 1 regularization with strength 150 was used. Lasso regularization improves model performance on unseen networks. This is achieved with generalization of the obtained knowledge from social network onto new unseen networks. As expected, coefficients show that all features of embedding vectors are equally im- portant. 8.2 Random forest Random forest did not perform well on our datasets. We can justify that by the fact that this approach lacks mechanism for regularization. Higher number of di- mensions in respect to number of samples (unbalan- ced training and unseen data) is causing our decision tree models to overfit. Grid search in this case did not yield specific results, as well as tuning of parameters failed to find feature dependent information. This behavior is shown in our model comparison where it is expected to experience better benchmarks on dif- ferent linear models such as logistic regression and SVM. We noticed that unseen networks’ AUROC sco- res are the lowest over all datasets, therefore we can conclude that random forest model did not respond well to our problem. 8.3 Support vector machine (Svm) For support vector machine (SVM) only kernels were carefully tuned. Best fit for each dataset was chosen T abela 3: AUROC values for logistic regression, random forest, support vector machine (Svm) and neural network (NN) on unseen data dataset Logistic regression Random forest Svm NN Baseline 0.9263 0.9031 0.9570 0.9560 Topological 0.9478 0.8901 0.9563 0.9538 embeding 0.9229 0.9047 0.9217 0.9218 UPORABNA INFORMATIKA 92 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK using grid search. Grid search consisted of linear, polynomial and Gaussian kernel, so the model co- uld work with arbitrary dimensional data. It turned out that for baseline and topological datasets linear kernel was the best option, while embedding dataset required Gaussian kernel. 8.4 Neural network (NN) Choosing the right hyperparameters for neural net- works (NN) was very complicated and tedious task, since neural networks have a lot of different para- meters. Nevertheless, correctly setting them can yie- ld much better performance in comparison to other models. For some of the parameters like loss and op- timization functions default settings were selected. Because learning-based link prediction is a binary classification task, binary cross-entropy loss functi- on and Adam optimizer were utilized. Hidden and output activation functions were selected using ran- dom experimentation. The best results yielded ReLU as hidden activation function and sigmoid as output activation function. Lastly, architecture of neural ne- twork had to be defined, which was done using grid search. We tried a great variety of different depths and numbers of nodes per layer, but in the end ar- chitectures with only two hidden layer and small number of nodes were the best performing on all datasets. 9 dISCUSSION Embedding dataset yielded worse results than topo- logical and baseline ones. This could be so because of greater linear separability of topological and baseline datasets as SVM results from section 7 were implying. It is much easier to train models on linearly separable data than on complex one with a lot of non-linearity, such as node2vec. Baseline and topological data seem to be well li- nearly separable, because SVM works best on them when combined with linear kernel. Also, logistic re- gression performs considerably better than random forest. This is highly unusual for non-linear datasets, but in this case random forest harder adjusts to linear data, while logistic regression is linear by default. Trying to prove our hypothesis that topological dataset is more linearly separable than embedding one, we visualized data using linear discriminant analysis (LDA). If any linearity exists in the dataset, it should be visible in the low-dimensional space. T o visualize data we uniformly sampled 200 samples from test data of both datasets and calculated LDA decision boundary for binary classification. LDA analysis seems to show that the data is approximately equally linearly separable in topolo- gical dataset (figure 6) as well as in embedding one (figure 7). Although, accuracy score of primitive LDA classifier yielded slightly better results on topological test set 87%, in comparison to embedding one which had score of 86.5%. A more concrete evidence of wor- se linear separability is probably the fact that SVM with Gaussian kernel performed better on embed- ding dataset than SVM with the linear kernel. Nonetheless, it appears that embedding dataset is still linearly separable to some degree. This is com- pletely unexpected as node2vec produces inheren- tly non-linear embeddings. These interesting results Figure 6: Lda on topological dataset Figure 7: LDA on embedding dataset UPORABNA INFORMATIKA 93 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK pose a new research question that could be examined in future work. Besides assumed linearity, topological dataset co- uld also give better results due to small number of fe- atures (9 features), whereas embedding dataset was significantly more complex (66 features). It is harder to train models on data with a lot of features, becau- se models automatically require more parameters, so worse performance of embedding dataset could also stem from that. 10 CONCLUSION In the presented paper we conducted the first com- prehensive analysis of link prediction on the Face- book100 network. We can conclude that results are unexpectedly good for link prediction task of this nature. After all, we are predicting friendships on completely separated social networks. It is visible that models successfully generalized to unseen data, even though train set is only 50% bigger than unseen data set. Therefore, it is safe to say that our models succeeded in their task. For optimization of AUROC score baseline and topological approaches are the best. It turns out that simplicity has benefits in terms of higher classificati- on scores. In these two cases node-based features did not really effect performance, except when combined with logistic regression. Although SVM and NN got better results, we recommend logistic regression in combination with topological approach because the model is easier to train and interpret. When very high AUROC scores are important (e.g. link predic- tion on medical data), we suggest SVM with linear kernel and baseline approach. It gets almost the same results on unseen data, even though it is simpler mo- del than NN. We have shown that collecting data from multiple social networks yields promising datasets, which can be used for modelling of various predictors in similar social structures. Besides that, in this paper we have shown that use of regularization can be a solution in the case of social networks, when lack of the training data is present. Using this approach, we can obtain data insights globally. For future work it would be interesting to show how much linearly separable is embedding dataset. Even more fascinating would be to find the true un- derlying cause of the observed behavior. 11 ReFeReNCeS [1] [Adamic and Adar, 2003] Adamic, L. and Adar, E. (2003). Frien- ds and neighbors on the web. Social Networks, 25:211–230. [2] [Barabási et al., 2002] Barabási, A., Jeong, H., Néda, Z., Ra- vasz, E., Schubert, A., and Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A: Statisti- cal Mechanics and its Applications, 311(3-4):590–614. [3] [Breiman, 2001] Breiman, L. (2001). Machine learning, volume 45, number 1 - springerlink. Machine Learning, 45:5–32. [4] [Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Mach. Learn., 20(3):273–297. [5] [Grover and Leskovec, 2016] Grover, A. and Leskovec, J. (2016). Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conferen- ce on Knowledge Discovery and Data Mining, KDD ’16, page 855–864, New York, NY, USA. Association for Computing Ma- chinery. [6] [Hasan et al., 2006] Hasan, M. A., Chaoji, V., Salem, S., and Zaki, M. (2006). Link prediction using supervised learning. In In Proc. of SDM 06 workshop on Link Analysis, Counterterro- rism and Security. [7] [Jeh and Widom, 2002] Jeh, G. and Widom, J. (2002). Simrank: A measure of structural-context similarity. In Proceedings of the Eighth ACM SIGKDD International Conference on Kno- wledge Discovery and Data Mining, KDD ’02, page 538–543, New York, NY, USA. Association for Computing Machinery. [8] [Jovic et al., 2015] Jovic, A., Brkic ´, K., and Bogunovic, N. (2015). A review of feature selection methods with applications. pages 1200–1205. [9] [Katz, 1953] Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1):39– 43. [10] [Liao et al., 2017] Liao, L., He, X., Zhang, H., and Chua, T.-S. (2017). Attributed social network embedding. IEEE Transacti- ons on Knowledge and Data Engineering, PP. [11] [Liu et al., 2013] Liu, H., Hu, Z., Haddadi, H., and Tian, H. (2013). Hidden link prediction based on node centrality and weak ties. EPL (Europhysics Letters), 101:18004. [12] [Lü et al., 2009] Lü, L., Jin, C.-H., and Zhou, T. (2009). Simila- rity index based on local paths for link prediction of complex networks. Physical review. E, Statistical, nonlinear, and soft matter physics, 80:046122. [13] [Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, 2013. [14] [Newman, 2001] Newman, M. E. J. (2001). Clustering and preferential attachment in growing networks. Physical revi- ew. E, Statistical, nonlinear, and soft matter physics, 64 2 Pt 2:025102. [15] [Perozzi et al., 2014] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, KDD ’14, page 701–710, New York, NY, USA. Association for Compu- ting Machinery. [16] [Pow et al., 2012] Pow, J., Gayen, K., Elliott, L., and Raeside, R. (2012). Understanding complex interactions using social net- work analysis. Journal of clinical nursing, 21:2772–2779. [17] [Rossi and Ahmed, 2015] Rossi, R. A. and Ahmed, N. K. (2015). The network data repository with interactive graph analytics and visualization. In AAAI. [18] [Salton and McGill, 1986] Salton, G. and McGill, M. J. (1986). Introduction to Modern Information Retrieval. McGraw-Hill, Inc., USA. UPORABNA INFORMATIKA 94 2021 - πtevilka 2 - letnik XXIX Tim Poštuvan, Semir Salkić, Lovro Šubelj: LEARNING-BASED LINK PREDICTION ANAL YSIS FOR FACEBOOK100 NETWORK [19] [Scellato et al., 2011] Scellato, S., Noulas, A., and Mascolo, C. (2011). Exploiting place features in link prediction on loca- tion-based social networks. In KDD, pages 1046–1054. ACM. [20] [Scott, 1988] Scott, J. (1988). Social network analysis. Socio- logy, 22. [21] [Traud et al., 2011] Traud, A. L., Kelsic, E. D., Mucha, P. J., and Porter, M. A. (2011). Comparing community structure to cha- racteristics in online collegiate social networks. SIAM Rev., 53(3):526–543. [22] [Traud et al., 2012] Traud, A. L., Mucha, P. J., and Porter, M. A. (2012). Social structure of Facebook networks. Phys. A, 391(16):4165–4180. [23] «[Wang et al., 2014] Wang, P., Xu, B., Wu, Y., and Zhou, X. (2014). Link prediction in social networks: the state-of-the- -art. Science China Information Sciences, 58. [24] [Zhang et al., 2018] Zhang, Z., Cui, P., Wang, X., Pei, J., Yao, X., and Zhu, W. (2018). Arbitrary-order prox- imity preserved network embedding. In Proceedings of the 24th ACM SI- GKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, page 2778–2786, New York, NY, USA. Association for Computing Machinery. [25] [Zhou et al., 2009] Zhou, T., Lü, L., and Zhang, Y.-C. (2009). Predicting missing links via local information. The European Physical Journal B - Condensed Matter and Complex Sy- stems, 71:623–630.  T im Poštuvan is in the final year of his bachelor’ s degree study . He is studying Computer Science and Mathematics at Faculty of Mathematics and Physics and Faculty of Computer and Information Science, University of Ljubljana. His primary research interests lie in machine learning on graphs, however , he is fond of other areas of artificial intelligence as well.  Semir Salkic´ received a bachelor’ s degree from the Faculty of Electrical Engineering, University of Sarajevo in 2018. He is currently studying Data Science Master Programme at Faculty of Computer and Information Science, University of Ljubljana. He is working as a lead software and embedded engineer at Research Center at Faculty of Applied Sciences Kiel, Germany . His research is focused on practical applications of IoT structures with machine learning.  Lovro Šubelj je docent na Fakulteti za računalništvo in informatiko Univerze v Ljubljani. Diplomiral je leta 2008 na Fakulteti za matematiko in fiziko in Fakulteti za računalništvo in informatiko ter doktoriral leta 2013 na temo analize velikih omrežij. Je avtor ali soavtor več kot šestdeset znanstvenih prispevkov in patentov ter urednik prestižnih mednarodnih znanstvenih revij. Njegovo preteklo delo je bilo izbrano kot izjemen znan - stveni dosežek v Sloveniji ter predstavljeno na uglednih mednarodnih univerzah kot sta Stanford in UCSD. Sodeloval je že pri številnih uspešno zaključenih raziskovalnih in razvojnih projektih v sodelovanju s podjetji Petrol, Celtra, Optilab, Iskratel in drugimi.