ELEKTROTEHNI ˇ SKI VESTNIK 89(1-2): 64–72, 2022 ORIGINAL SCIENTIFIC PAPER Topic extraction by clustering word embeddings on short online texts David Nabergoj 1, † , Alessandro D’Alconzo 2 , Danilo Valerio 2 , Erik ˇ Strumbelj 1 1 Faculty of Computer and Information Science, University of Ljubljana, Slovenia 2 Siemens AG Austria † E-mail: david.nabergoj@student.uni-lj.si Abstract. We demonstrate our topic extraction method in which topics are treated as clusters of word embeddings. The OPTICS algorithm is used to find small and arbitrarily-shaped clusters of embeddings, produced by a fastText model. The result is a set of dominant and non-dominant domain-specific topics. The focus of the method is on short online posts which are difficult to analyze with traditional topic extraction approaches because of the word collocation scarcity. The method is tested on dataset of posts from Twitter, LinkedIn and company blogs related to industrial automation. The method significantly outperforms traditional topic extraction approaches by finding relevant and understandable topics with related tokens. Keywords: topic extraction, industrial automation, text mining Iskanje tem v kratkih spletnih besedilih z gruˇ cenjem vloˇ zitev besed Predstavimo novo metodo za iskanje tem v besedilih, ki teme obravnava kot gruˇ ce vloˇ zitev besed, dobljenih z modelom fastText. Naˇ sa metoda z algoritmom OPTICS najde gruˇ ce poljubnih oblik, ki predstavljajo prevladujoˇ ce in tudi manj opazne domensko specifiˇ cne teme. Metoda je primerna za kratka spletna besedila, ki jih je teˇ zko analizirati s klasiˇ cnimi pristopi za iskanje tem zaradi majhnega ˇ stevila kolokacij. Metodo testiramo na podatkovni mnoˇ zici objav s strani Twitter, LinkedIn in blogov razliˇ cnih podjetij, povezanih z industrijsko avtomatizacijo. Naˇ sa metoda najde relevantne in razumljive teme s smiselnimi besedami ter deluje bistveno boljˇ se od klasiˇ cnih pristopov. 1 INTRODUCTION When doing market research, companies are interested in what their customers, suppliers, and competitors are doing and talking about. Gathering relevant information is a resource intensive task, especially if done manually. Instead, our work is aimed at automatically extracting topics from large collections of social media and blog posts. In particular, we propose an approach that can also identify less dominant and domain-specific topics that are often missed by standard approaches to topic extraction. Topic extraction is typically performed using latent Dirichlet allocation (LDA) [1] and latent semantic anal- ysis (LSA) [2]. Another commonly used probabilistic Received 8 March 2022 Accepted 21 March 2022 model is the hierarchical Dirichlet process (HDP) [3]. However, it has been found that such techniques achieve poor performance on shorter texts [4], which appear on social media and are an important source of information. Recent research in this area has focused on improving existing probabilistic models for Twitter posts [5], [6] and general documents [7]. Other approaches find topics by clustering vector representations of documents or tokens. Early research describes clustering documents, represented by TF- IDF [8] or bag-of-words features [9], [10]. More recent research focuses on clustering word embeddings [11]– [14]. Language models which produce such embeddings have gained a lot of traction because of their ability to capture contextual information better than bag-of-words- based approaches. The original proposed method was word2vec [15], but many improvements have been made with models like fastText [16]. Topic extraction methods typically find dominant top- ics – those that are easily noticeable and popular. This task becomes difficult with short documents because of scarcity in word collocations, so additional modeling considerations and domain knowledge may be required. It would be useful to not only find dominant, but also non-dominant topics in such datasets. No listed work specifically addresses the issue of finding non-dominant topics within a set of short documents. In this paper, we present a topic extraction method based on clustering fastText word embeddings with the OPTICS algorithm [17]. Each cluster corresponds to a topic, represented by a set of related tokens. The method proposes many logical topics which the user can TOPIC EXTRACTION BY CLUSTERING WORD EMBEDDINGS ON SHORT ONLINE TEXTS 65 reasonably quickly to determine their relevance. It can handle short and medium-length texts with misspellings, which are very common on microblogging websites like Twitter and LinkedIn. We are able to discover dominant and non-dominant topics without using domain knowl- edge in the modeling process. In Section 2, we describe the processes for gen- erating embeddings with fastText and clustering them with OPTICS. We describe the evaluation and results in Section 3. We discuss the method’s behavior, potential improvements, and future work in Section 4. 2 METHODS The two main stages in our approach are generating token embeddings and clustering them. We first describe how fastText is used for token generation and then how OPTICS can propose potentially relevant topics by clustering token embeddings. 2.1 Generating token embeddings A word embedding is a method for mapping a word from a set of documents to a real-valued vector. Because some words frequently appear together, we treat them more generally as tokens, which are sequences of char- acters. Similarity between tokens can be expressed as the Euclidean distance between their embeddings. This reveals which tokens are semantically or syntactically related. We use fastText to generate token embeddings. Fast- Text is a library for text classification and representation. It contains algorithms which transform text into contin- uous vectors that can later be used on language related tasks. We refer to models, built using this library, as fastText models. These models can be trained in a matter of seconds compared to other models, which require several hours or even days. * The embedding quality is only slightly worse than that of state-of-the-art approaches. The fast training time facilitates experimentation and lets us analyze large datasets quickly. The models are trained using subword information about all character n-grams of length two or more. Such information is helpful, because our text originates from microblogging services, which are known to often contain misspellings. A correct and incorrect spelling of a word would otherwise be treated as entirely different, even though a person would understand they refer to the same concept. Similarly, this information also helps relate words with the same lemma or stem, which would otherwise need to be grouped using additional lemmatization or stemming steps. Considering subword information can hence al- leviate collocation scarcity issues by closely relating similarly spelled tokens. ∗ https://research.fb.com/blog/2016/08/fasttext/, accessed March 1, 2022 2.2 Clustering token embeddings We need to make some important considerations when choosing a clustering algorithm. The number of topics in a set of documents is usually unknown, so we either need to guess it or determine it automatically. It is likely that some embeddings do not belong to any topic. These may be treated as noise. The embedding space may be difficult to understand and algorithms which rely on finding clusters of specific shapes may be inappropriate. We assume that embeddings form dense regions if they refer to similar tokens, but the shape of the regions may be arbitrary. With these considerations, we choose the family of density-based clustering algorithms as an ap- propriate choice for embedding clustering. In particular, we choose OPTICS as the clustering algorithm because of its ability to find clusters of varying density and shape, as well as its ability to detect noisy embeddings. OPTICS can be understood as an extended DBSCAN algorithm [18]. In DBSCAN, density-based clusters are defined as sets of density-connected objects. Put simply, density-connected objects can be reached from one another through a chain of objects which are less than ϵ apart. Each cluster contains core objects and border objects. Core objects have at least MinPts neighbors in their ϵ -neighborhood and border objects have fewer. The idea of OPTICS is to extend this concept so that several ϵ parameters are used simultaneously which means finding clusters of different densities. We can find such varying-density clusters consistently if we obey a specific processing order. OPTICS generates this order, but does not explicitly assign cluster memberships. The authors proposed an algorithm to automatically assign these using the ξ parameter. By plotting reachability- distances of ordered objects, we see clusters as dents in the plot as visualized in Figure 1. The ξ parameter determines the necessary steepness of a dent at its beginning and end so its objects can be treated as a cluster. The noise set are those objects, which do not appear in any detected dent. Higher ξ -values can find only the most significant clusters, whereas lower ξ - values can find less significant clusters at the expense of also finding more noisy ones. Our method works by iteratively clustering token em- beddings at different parameter settings of the OPTICS algorithm. We start with an initial value of MinPts, which helps us identify topics with the most tokens. In each iteration, we decrease this value by 1 to find topics with fewer tokens. We set the hyperparameter ξ > 0 and choose a threshold for the minimal cluster size. Topics with fewer tokens than this threshold will not be proposed. Selecting a small minimum cluster size and a small initial value of MinPts can reveal non- dominant topics, because they may contain fewer tokens than dominant topics. The entire procedure is described in Algorithm 1. We 66 NABERGOJ, D’ALCONZO, V ALERIO, ˇ STRUMBELJ Figure 1.: Toy visualization of the OPTICS reachability plot. The colored points were identified as clusters, whereas the gray points represent noise. associate tokens to their embeddings, so it is possible to represent a topic as a set of tokens in a cluster. We first cluster embeddings at an initial value of MinPts. The identified clusters are stored as sets of topic tokens. We remove embeddings which belong to such clusters and only keep noisy ones. We then decrease the value of MinPts by 1 and repeat the clustering process on noisy embeddings. The algorithm stops once the value of MinPts falls below 2. We remove clustered embeddings to make sure they do not appear again in future itera- tions. The user then determines which of the proposed topics are relevant. Algorithm 1: Topic extraction with OPTICS. Input: Set of embeddings E, initial MinPts value m. Result: Set of proposed topics P . 1 Let P be the set of proposed topics. 2 while m≥ 2 and E̸=∅ do 3 Cluster E with OPTICS(MinPts =m). 4 Move found topics from E into P . 5 m← m− 1 6 end 3 EVALUATION AND RESULTS In this section, we describe our evaluation and the dataset we used to test our method. Our dataset consists of online posts about industrial automation. This choice was motivated by a business task from Siemens, where the goal was to create an overview of trending topics for the industrial automation sector using open media data. We use our approach to find topics within the posts and compare the results to some baselines. Finally, we list some interesting identified topics and provide a visualization of these. 3.1 Evaluation dataset We obtained a dataset of text documents relating to industrial automation. These are posts from tech- nological companies on Twitter, LinkedIn, and their blogs. The list of these companies was provided by industrial automation experts at Siemens. The content mainly consists of current and upcoming events and fields, technological and financial progress updates, and internal company events. Most posts were made between the years 2010 and 2020. By manually checking some randomly chosen documents, the most noticeable topics are sustainability, safety, maintenance, climate, smart technology, robotics, and artificial intelligence. We use standard text pre-processing techniques to transform the documents into a more suitable form for model training and analysis. Each sentence in a document is represented as a sequence of tokens, a convention for use in many text analysis algorithm implementations. We summarize the pre-processing pro- cedure in Figure 2. Input document Split document into sentences Convert text to lowercase Split sentences into tokens LinkedIn post Remove # characters Twitter post Remove # and @ characters, urls, emojis, digits Remove punctuation from tokens Output sentences Yes No Yes No Figure 2.: Document pre-processing flowchart for blog posts, Twitter posts, and LinkedIn posts.. The input is a text document, the output is a list of sentences where each sentence is a list of tokens. We first transform documents to lowercase. We perform tokenization with the TweetTokenizer, Punkt- SentenceTokenizer, and TreebankWordTokenizer classes TOPIC EXTRACTION BY CLUSTERING WORD EMBEDDINGS ON SHORT ONLINE TEXTS 67 from the nltk library [19]. We split Twitter posts into to- kens using the TweetTokenizer class. We split other doc- uments into sentences with the PunktSentenceTokenizer class and transform them into tokens using the Tree- bankWordTokenizer class. Each Twitter and LinkedIn post corresponds to a single sentence, but blog posts may consist of several sentences. We do not retain any punctuation-only tokens. We remove “#” characters from LinkedIn and Twitter posts. We also remove mentions (tokens that start with “@”), url tokens, emojis, and dig- its. We remove all remaining punctuation within tokens and also remove zero-length tokens that are generated as a result of these transformations. The result is a collection of sentences where each sen- tence is a sequence of tokens. We treat these sentences as individual documents. The final dataset contains 32831 documents – 14077 Twitter posts, 14916 LinkedIn posts, and 3838 posts from company blogs. 3.2 Evaluation methodology Our evaluation is qualitative, which is not uncommon in topic modeling and clustering. We justify this with the following observations: • there is no labeled text dataset meant for finding non-dominant domain-specific topics, • many tokens do not belong to any topic, so metrics requiring class balance are inappropriate, • some sought topics are non-dominant, so metrics making use of frequencies are inappropriate, • our method is not probabilistic, so metrics such as perplexity are inappropriate, • the choice of relevant topics is subjective, • the space of embeddings is not well-understood. We will evaluate our method by manually checking the list of proposed topics. Most of the proposed top- ics should contain semantically or syntactically related tokens, but there may also be some unrelated ones. Depending on the dataset and the user, we also wish to see relevant topics which vary in terms of dominance. Besides qualitatively checking to what extent these criteria are satisfied, we will also propose some goal topics. These are topics which we expect are present in our dataset based on checking a random sample of documents. Our method should be able to identify these goal topics. 3.3 Settings and baseline for comparison The trained fastText model produces 40-dimensional token embeddings. When using this dimensionality in combination with our dataset, we saw dominant and non-dominant topics which are relevant and have re- lated tokens. Training is performed using the skip-gram model [20] and hierarchical softmax. These are benefi- cial, because they focus on infrequent words which may form domain-specific topics. We set the initial MinPts value to 10 and the minimum cluster size value to 3. We set the ξ parameter to 0.05 and use the Euclidean distance to determine core and boundary embeddings. We use LDA, LSA, and HDP models as baselines for comparison. To achieve the best baseline performance, we first transform tokens with the WordNetLemmatizer class [21] from the nltk library. This transforms most words into their base form, which is desirable when dealing with bag-of-words-based representations. We remove stopwords for topic extraction with the baseline methods, but this is not necessary to achieve good performance with our method. We train LSA with TF- IDF features. We train LDA and HDP with standard bag-of-words features. We use the fastText wrapper, LDA, LSA, and HDP implementations from the gensim library [22] and the OPTICS implementation from the scikit-learn library [23]. 3.4 Identified topics We used our method to extract topics from the eval- uation dataset. The identified relevant topics are listed in Table 1. We also list a sample of the proposed LDA topics in Table 2 for comparison. The full results are provided in the appendix. The LDA and LSA methods are able to identify a few relevant tokens, but almost no topics. HDP proposes topics with more interesting, but unrelated tokens. Examples of such tokens are “#PowerGrids”, “#GenderEquality”, “fossil-fueled”, and “dewatering”. We also compare methods with respect to their ability to identify selected goal topics in Table 3. Our method identifies dominant and non-dominant top- ics, which are more relevant than the baseline topics and contain tokens that are more closely related. Some of the identified topics are immediately no- ticeable in the dataset, namely the ones relating to sustainable energy, robotics, and climate change. How- ever, the e-voting, fish farming, and bioprocessing topics appear rarely in the dataset and a person needs to invest significantly more time into reading the documents to identify them manually. We also observed the proposed topics in different iter- ations of the algorithm. We find that most of the relevant topics are identified in later iterations of the algorithm, when the value of MinPts is small. This is visualized in Figure 3. Earlier iterations proposed topics whose tokens were mostly links, numbers, financial terms and figures, and non-English words. Later iterations proposed more topics. Most of those were irrelevant or had unrelated tokens, but there were also relevant topics, whose tokens were related. 3.5 Topic visualization We can present the identified topics in a way that reveals how they are related to each other. Since the underlying embeddings are high-dimensional, it is stan- dard practice to visualize them by using dimensionality 68 NABERGOJ, D’ALCONZO, V ALERIO, ˇ STRUMBELJ EV , chargers, #EVCharging electric, charging, stations, charger hybrid, vehicles, buses, vehicle gas, oil, natural, LNG solar, PV , rooftop wind, farm, turbine, turbines water, treatment, wastewater #WaterQuality, #SurfaceWater, #FishFarming hydraulic, filtration, roller drilling, drill, saws thermostat, radiator, thermostats bioprocesses, single-use, bioprocessing Cobots, #MobileRobots, #CollaborativeRobots IIoT, #EmpowerTheField, IoE #SmartHome, #V oiceActivated, #IFA15 SJI, #PCBdesign, circuitboard, PCB CX5230, #UltraCompact, #IndustrialPC, C6030 #EnergyTransition, #NetZeroCarbon #ClimateAction, neutral, #ClimateChange #ZeroWaste, #GlobalRecyclingDay #CarbonNeutral, #CarbonNeutrality, neutrality #EmissionsFree, ship, emissions-free #FoodWastage, #FoodLoss, #FoodWaste e-voter, #V otingElectronic, e-voters inclusive, workplace, culture #WomenInScience, #GenerationEquality Table 1.: Sample of relevant proposed topics, identified by our method. Rows correspond to topics, represented by their tokens. For ease of understanding, we present tokens in their original instead of their pre-processed form. The topics contain related tokens and are relevant in the industrial automation domain. They are mostly about electric vehicles, energy, industrial equipment and technology, robotics, the internet of things (IoT), climate change, and workplace culture. IoT, good, installed, innovation, questions January, experience, fair, panel, robotics learn, service, watch, offerings, solution stand, SPS, hall, sign, accelerate earnings, global, industries, increased general, operations, best, exchange, article close, support, application, download, benefits future, three, key, role, include Table 2.: Sample of proposed topics, identified by LDA. Rows correspond to topics, represented by their tokens. The topics are uninformative. There are some useful keywords like “IoT” and “robotics”, but they are not related to other tokens in their topics. Figure 3.: The number of proposed and relevant topics per iteration when processing the evaluation dataset. The starting value of MinPts is 10 and decreases by 1 in each iteration. Only three relevant topics are identified in the first six iterations. Most of the relevant topics are found in iteration 8 when MinPts is set to 3. reduction methods. We do so with Isomap [24], a non- linear dimensionality reduction technique that extends upon multi-dimensional scaling [25]. This is appropri- ate, because token embeddings likely contain essential nonlinear structures, which are invisible to linear dimen- sionality reduction techniques. The rough preservation of distances is important for understanding similarity between topics. We manually group the identified topics into nine broader categories to facilitate visualization and interpretation, as well as deal with topic overlap. We visualize the topics in Figure 4. 4 DISCUSSION We proposed a method of clustering fastText token embeddings with OPTICS on a dataset of short online industrial automation text documents. The method out- performs standard topic extraction techniques by propos- ing dominant and non-dominant domain-specific topics. The topics consist of semantically related tokens, so they can be easily understood by a person. A key observation and requirement is that similar token embeddings form dense regions. This criterion can be met by embedding-based language models. The use of fastText in particular facilitates this even further, because it can place similar embeddings close to each other solely based on their character-level similarity. In the extreme case of single-token documents, this would result in topics of similarly spelled words. On the other hand, having a set of documents with appropriate word collocations would combine word and character-level features to produce more informative topics. We can reduce the impact character-level similarity by increas- ing the minimum length of n-grams which should be considered in the fastText model training. A dataset may contain non-dominant relevant topics. There will not be many tokens which belong to such TOPIC EXTRACTION BY CLUSTERING WORD EMBEDDINGS ON SHORT ONLINE TEXTS 69 Goal topic Electric vehicles Climate change Robotics Clean energy Smart technology LDA EV , car, talk increase, carbon, footprint AI, factory, robotics / IoT, IIoT, cost LSA / / robotics, safety, robot project, digitalization, #EnergyEfficiency IoT, good, #EnergyEfficiency HDP / / / energy, production, international IoT, solutions, technology Our method EV , chargers, EV charging neutral, #ClimateAction, #ClimateChange robots, cobots, #MobileRobots #GridEnergy, #NetZeroCarbon, #EnergyTransition IFA15, #SmartHome, #V oiceActivated Table 3.: Topic extraction results on the industrial automation dataset. We selected five goal topics which we observed in a small random sample of documents and compared how well each method can identify them. The LDA, LSA, and HDP methods proposed 100 topics each. Each cell contains the three most appropriate tokens of the best proposed topic with respect to a goal topic. The baseline methods were sometimes unable to propose an appropriate topic; such cells are marked with “/”. All methods produced more than three tokens per topic. In baseline methods, all unlisted tokens were completely unrelated to the goal topic. Our method identified more relevant topics than individual baseline methods and also produced related tokens (listed and unlisted). Isomap visualization of grouped topics Figure 4.: Isomap visualization of the identified topic groups. Topic tokens are represented as circles with size corresponding to their frequency in the corpus. The plot shows topic importance and inter-topic relationships between the years 2010 and 2020 according to industrial automation companies. For example, the clean energy and climate change topics are close to electric vehicles, representing the sentiment that electric vehicles mean a step towards a cleaner world. The AI, robotics, and workplace culture topics are close together, possibly due to recent discussions of AI replacing human workers or simply incorporating AI in the standard industrial workflow. topics. As a result, these will be found at small values of MinPts and the minimum cluster size clustering hyper- parameters. However, the user will also have to inspect many other, potentially irrelevant topics with a small number of tokens. In our experimentation, the highest number of produced topics was 225 at MinPts = 2, so checking the proposed topics was manageable. However, the number of proposed topics may be higher and the issue needs to be further considered to avoid sifting through too many irrelevant topics. A small improve- ment could be made by removing stopwords before clustering, but this may affect token neighborhoods and cause some small clusters not to be detected. The OPTICS algorithm also outputs a hierarchy of clusters. In our dataset, this hierarchy was not noticeable from the reachability distance plot. We can inspect the generated hierarchy for a particular value of MinPts, but our method would need an additional procedure to combine hierarchies at different values of MinPts into a single one. We also tried running OPTICS once (not iteratively) with ξ = 0.01 and observed that there were not as many relevant topics as when using the original method. Such an approach did not produce a clear hierarchy, but it is possible that it would do so on a different dataset where word-level similarity is more significant than in short online posts. An interesting direction for future work is using existing pre-trained language models instead of training one from scratch. This means having better embedding spaces which allow for topic interpretations with a hierarchy. It would be useful to reduce the number of irrelevant proposed topics as much as possible, likely 70 NABERGOJ, D’ALCONZO, V ALERIO, ˇ STRUMBELJ by using domain knowledge. It would also be useful to evaluate the method quantitatively. A suggestion for this is to consider a dataset and ask different users to select subjectively relevant topics from the list of proposed topics. This can be done at different hyperparameter settings and the user-specified topics may serve as targets in numeric evaluation. The method performs well on industrial automation posts, but testing it on data from different domains would help further assess its performance. ACKNOWLEDGEMENTS Our research was partially supported by the Slovenian Research Agency (ARRS) research core funding P5- 0410. REFERENCES [1] David M. Blei, Andrew Y . Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [2] Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284, 1998. [3] Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Sharing clusters among related groups: Hierarchical dirich- let processes. In Advances in neural information processing systems, pages 1385–1392, 2005. [4] Liangjie Hong and Brian D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80–88, 2010. [5] Yuan Wang, Jie Liu, Yalou Huang, and Xia Feng. Using hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7):1919–1933, 2016. [6] Chi-Yu Liu, Zheng Liu, Tao Li, and Bin Xia. Topic modeling for noisy short texts with multiple relations. In SEKE, pages 610–609, 2018. [7] Min Shi, Jianxun Liu, Dong Zhou, Mingdong Tang, and Buqing Cao. We-lda: a word embeddings augmented lda model for web services clustering. In 2017 ieee international conference on web services (icws), pages 9–16. IEEE, 2017. [8] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive data sets. Cambridge university press, 2020. [9] B. Shen and Yingsi Zhao. Optimization and application of optics algorithm on text clustering. Journal of Convergence Information Technology, 8:375–383, 2013. [10] Ahmed Rafea and Nada A. Mostafa. Topic extraction in social media. In 2013 International Conference on Collaboration Technologies and Systems (CTS), pages 94–98. IEEE, 2013. [11] Kazuma Hashimoto, Georgios Kontonatsios, Makoto Miwa, and Sophia Ananiadou. Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of Biomedical Informatics, 62:59–65, 2016. [12] Suraj Subramanian and Deepali V ora. Unsupervised text classi- fication and search using word embeddings on a self-organizing map. International Journal of Computer Applications, 156:35– 37, December 2016. [13] Xiangfeng Dai, Marwan Bikdash, and Bradley Meyer. From so- cial media to public health surveillance: Word embedding based clustering method for twitter classification. In SoutheastCon 2017, pages 1–7. IEEE, 2017. [14] Guilherme Raiol De Miranda, Rodrigo Pasti, and Leandro Nunes de Castro. Detecting topics in documents by clustering word vectors. In International Symposium on Distributed Computing and Artificial Intelligence, pages 235–243. Springer, 2019. [15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [16] Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Infor- mation. In Transactions of the Association for Computational Linguistics, pages 135–146, 2017. [17] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and J¨ org Sander. Optics: ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60, 1999. [18] Martin Ester and Hans-peter Kriegel and J¨ org Sander and Xi- aowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, pages 226–231, 1996. [19] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009. [20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [21] George A. Miller. WordNet: An electronic lexical database. MIT press, 1998. [22] Radim ˇ Reh˚ uˇ rek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/ publication/884893/en. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [24] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000. [25] Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964. David Nabergoj received his M. Sc. degree in 2021 from the University of Ljubljana, Faculty of Computer and Information Science. He is currently researching normalizing flows and Bayesian statistics as part of a group in the Berkeley Center for Cosmological Physics. Alessandro D’Alconzo is a senior data scientist at Siemens AG, Vienna. He received his PhD degree in Information Engineering from the Polytechnic of Bari (Italy), in 2007. His main research interests are in the field of network traffic monitoring and application of machine learning to industrial, and business analytics domains. Danilo Valerio is a senior research scientist at Siemens Technology, Vienna. He received his PhD degree in Computer Science from the Faculty of Informatics in Vienna. His main research interests are in the field of machine learning in the domains of urban, industrial, and sales analytics. Erik ˇ Strumbelj is an associate professor at the Faculty of Computer and Information Science, University of Ljubljana. His main research interests lie in Bayesian statistics and machine learning. APPENDIX We provide a comprehensive list of identified topics for our method and baseline methods: latent semantic analysis (LSA), latent Dirichlet allocation (LDA), hierarchical Dirichlet process (HDP). We present the results for our method in Table 4 and the results for LDA, LSA, and HDP in Tables 5, 6, and 7 respectively. Tokens are presented in their original form for ease of understanding. The actual output tokens are in lowercase and without “#” characters. TOPIC EXTRACTION BY CLUSTERING WORD EMBEDDINGS ON SHORT ONLINE TEXTS 71 EV , chargers, #EVCharging cars, car, batteries electric, charging, stations, charger turbochargers, turbocharger, turbocharging supercharger, supercharging, superchargers hybrid, vehicles, buses, vehicle #HybridElectric, E-Fan X, E-plane gas, oil, natural, LNG solar, PV , rooftop wind, farm, turbine, turbines SWT, #WindTurbines, #WindTurbine water, treatment, wastewater #WaterQuality, #SurfaceWater, #FishFarming hydraulic, filtration, roller drilling, drill, saws thermostat, radiator, thermostats Cobots, #MobileRobots, #CollaborativeRobots collaborative, robot, LD intelligence, artificial, nanotechnology e-voter, #V otingElectronic, e-voters Dow, Jones, indices, #RobecoSam Nasdaq, quoted, OMX IIoT, #EmpowerTheField, IoE #PCbased, PCs, PC, PC-based SJI, #PCBdesign, circuitboard, PCB #EnergyTransition, #NetZeroCarbon, #GridEnergy #ClimateAction, neutral, #ClimateChange #ZeroWaste, #GlobalRecyclingDay #CarbonNeutral, #CarbonNeutrality, neutrality #EmissionsFree, ship, emissions-free EP100, #NetZero, #CO2neutral #EUEnergyDay, #SustainableDevelopment #FoodWastage, #FoodLoss, #FoodWaste friendly, efficient, environmentally, #EnergyEfficient #PanelBuilder, #PanelBuilding inclusive, workplace, culture #WomenInScience, #IWD2020, #GenerationEquality Boost40, TTTech, cybernetics CX5230, #UltraCompact, #IndustrialPC, C6030 SPS, drives, IPC #SmartHome, #V oiceActivated, #IFA15 Russwurm, #WebOfSystems, WOS biomethane, #EndressHauser, biomethan bioprocess, bioprocesses, single-use, bioprocessing Table 4.: All topics that were identified by our method. Rows in both tables correspond to topics, represented by their tokens. The tokens of each topic are related and many of them are hashtags (tokens that start with the “#” character). Some topics are semantically similar to others. share, located, deliver, focus, construction read, ensure, addition, designed, July acquisition, position, website, operating, latest local, costs, leading, reduce, receive forward, flow, cost, country, intelligent day, help, decreased, ready, change control, software, renewables, center, robots annual, vice, president, Europe, easy existing, safety, great, discover, higher engineering, approx, total, potential, efficient handling, services, large, plants, orders binding, return, long, save, standards order, plant, capacity, people, Australia register, join, webinar, facility, continue tonnes, year, received, scope, full February, source, electric, range, cooperation learn, service, watch, offerings, solution stand, SPS, hall, sign, accelerate earnings, global, industries, increased group, capital, CEO, business, EBIT india, increase, happy, changes, core contribute, signed, awarded, advanced, enable product, growing, quality, high, unit January, experience, fair, panel, robotics Copenhagen, effective, start, friday, efficiency supply, China, work, electrical, link international, IPC, technical, option, twincat shares, blog, based, products, increasing head, field, type, aging, library supervisory, report, board, group, executive IoT, good, installed, innovation, questions financial, processing, place, drive, industry contract, well, largest, strengthen, years process, design, project, training, countries booth, find, presented, June, infrastructure general, operations, best, exchange, article close, support, application, download, benefits future, three, key, role, include production, press, release, Yokogawa, projects offer, free, working, exclusive, #LinesShare Table 5.: Sample of topics, identified by LDA. Rows in both tables correspond to topics, represented by their tokens. Most topics are completely uninformative. There are some useful keywords like “IoT” and “robotics”, but they generally unrelated. 72 NABERGOJ, D’ALCONZO, V ALERIO, ˇ STRUMBELJ –, address, questions, co ˙ , a/s visible, LinkedIn, –, join, follow authority, supervisory, Danish, financial learn, read, hall, booth, solutions thx, retweets, favs, follow, learn learn, read, hall, booth, find read, hall, booth, forward, stand hall, booth, read, energy, ’s ’s, find, DKK, energy, order find, ’s, DKK, energy, solutions DKK, energy, ’s, solutions, industry ’s, DKK, energy, solutions, safety forward, stand, hall, year, future future, plant, power, industry, forward video, watch, solutions, find, industry booth, stand, forward, year, SPS future, solutions, power, energy, industry technology, solutions, power, safety, control industry, solutions, future, power, year technology, year, register, plant, free industry, forward, register, year, booth register, join, technology, solutions, forward day, SPS, safety, report, help safety, IIoT, digital, join, technology control, help, report, annual, year activate, larger, view, link, year control, technology, year, day, activate hannover, SPS, drives, IPC, power safety, check, hannover, help, register Hannover, help, join, power, stand #LifeIsOn, check, digital, join, IoT digital, help, safety, check, weg ready, join, register, business, digital join, technology, plant, register, control digital, check, control, ready, lifeison EcoStruxure, digital, check, day, business help, order, check, business, join business, work, digital, ready, people weg, ready, production, register, robotics ready, EcoStruxure, day, people, free Table 6.: Sample of topics, identified by LSA. Rows in both tables correspond to topics, represented by their tokens. Most topics are completely uninformative as in LDA. Many tokens appear in several topics. periscope, tonnages, percent, grandir Leterrier, anticipare, targets, #Expo2020 industry-leading, ’s, #V oiceActivated, ontvang negative, #WOTC19, topped, skier –, #DigitalEngineering, elevate, supervisory proceso, GDSN, Dornbirner, ZRH Pinterest, #ThinkTank, #ITxpo, sushi-sensor Sydney, banana, Venkat, beruf sequences, arena, anderen, eigentlich busters, Caxitu, Turbocor, APIs Santos, core, software-suite, 17 stick, hin, te, abreast counterpart, walk, #BeeBetter, banken Jurvetson, runden, entre, genau invalides, group, dear, anticipare plant, foothold, –, #SqueezeOut intracellular, bronze, IoTSWC17, innovative wenige, drawn, garantendoti, higher Goodknecht, start, kijken clothes, dijital, Canadian, roses functionality, #PowerGrids, Lalbagh, #IndustrialDrives #EPlanInfoCenterApp, space-savings, learn, Johnson awaken, terry, sophisticated, Adelaide Denice, #EatonEngaged, AFDB, signature railways, #DeviceConfiguration, Pearson, frohe cyanidation, self-diagnosis, oprex, #GenderEquality withholding, wet, fossil-fueled, cons Biraschi, uninsured, seated, provisions host, svenska, Udo, AICCE significative, bridge, tube, revisit inaction, AP02, stabilise, dewatering //eh.digital/2ldlg65, #EarlyDetection, #UseCases, oyu proposal, booth, Stahl, built #WorkTruckWeek, electrical-, arguably, visible Highgrade, kennt, offering-, Brullon WBCSD, dreiteiliger, //eh.digital/2ngktwn, ayudan #EndressHauser, #DigitalTag, retweets, learn #WeAreStillIn, indutry, #AutoID, shatter future, foothold, #vision2020, Nooriabad intuitive, weir, pitfalls, revitalize Table 7.: Sample of topics, identified by HDP. Rows in both tables correspond to topics, represented by their tokens. While the topics are not informative, they do contain some relevant tokens, such as “#PowerGrids”, “#GenderEquality”, “fossil-fueled”, and “dewatering”.