COMBINING AVAILABLE DATASETS FOR BUILDING NAMED ENTITY RECOGNITION MODELS OF CROATIAN AND SLOVENE
Nikola LJUBEŠIC, Marija STUPAR, Tereza JURIC, Željko AGIC
Department of Information and Communication Sciences, Faculty of Humanities and Social
Sciences, University of Zagreb
Ljubešic, N., Stupar, M., Juric, T., Agic, Ž. (2013): Combining Available Datasets for Building
Named Entity Recognition Models of Croatian and Slovene. Slovenščina 2.0,1 (2): 35-57.
URL: http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_03.pdf.
The paper presents efforts in developing freely available models for named entity recognition and classification in Croatian and Slovene text. Our experiments focus on the most informative set of linguistic features taking into account the availability of language tools and resources for the languages in question. Besides the classic linguistic features, distributional similarity features calculated from large unannotated monolingual corpora are exploited as well. We performed two batches on experiments, the first one on a self-built dataset on which the optimal set of features is sought, and a second batch with additional, much larger datasets obtained at a later point on which we verify the findings from the first batch. On the initial dataset using distributional information improves the results for 7-8 points in F1 while adding morphological information improves the results for additional 3-4 points in both languages. The second batch of experiments shows that morphosyntactic and distributional information lose importance as the dataset size significantly increases. The best performing models that use distributional information only, along with test sets for comparison with existing and future systems are made publicly available for both academic and non-academic use.
Keywords: named entity recognition, distributional similarity, Croatian language, Slovene
language
1 INTRODUCTION
Named entity recognition and classification (NERC), nowadays often called
just named entity recognition (NER) is a subtask of the information extraction task. It aims to locate and classify text elements into predefined categories, and is regularly applied on more complex natural language processing problems, using statistical or rule-based models. State-of-the-art systems tend to be open-domain and language independent.
This paper presents efforts in creating NER models for Croatian and Slovene language, available for free academic and non-academic use. Besides performing initial experiments on datasets developed from our side, we experiment with multiple recently published datasets for both languages as well. Thereby we receive a clearer picture of the underlying phenomena and manage to publish models of greater robustness and higher accuracy than those built on our previous datasets only.
The tool we are using to build the models is the Stanford Named Entity Recognizer (StanfordNER), nowadays a frequently used tool for NER. It is an implementation of Conditional Random Fields (CRF) sequence models and is available under GNU GPL license and free for academic use (Finkel et al. 2005). Besides many feature extractors that come with this tool, it is designed to work with the clustering method proposed by (Clark 2003) which combines standard distributional similarity with morphological similarity to cover infrequent words for which distributional information alone is unreliable.
This paper is structured as follows: in Section 2 we give an overview of related work, in Section 3 we present the datasets used in our research. Section 4 gives the experimental setup, section 5 the results of our initial experiments and section 6 the results of the experiments on additional datasets. We lay out the main conclusions in section 7.
2 RELATED WORK
To our knowledge, there has been some effort in developing NER systems for South Slavic languages which were mostly rule-based. New statistic based
approaches have recently emerged.
A rule-based system for Croatian described in (Bekavac 2005) uses regular grammars for recognition and classification of names over annotated texts. The system contains a module for sentence segmentation, lexicon of common words, specialized lists of names and transducers for automatic recognition of certain word forms.
A statistical approach described in (Bosnjak 2007) uses a semi-supervised method based on lists of names and entity extraction system.
For Serbian a rule-based system (Vitas, PavloviC-Lazetic 2008) based on lexical recognition is developed. The authors point out certain differences between English and Serbian language that make the task of building a successful system for Serbian challenging, as well as all the other Slavic languages which require a more thorough preparation of the system due to rich inflection.
None of the presented systems are available for academic usage which hinders researchers investigating tasks that require NER as a preprocessing step. One of the main intentions of our research is to improve this situation. In the process of building a good NER system, features are considered as important as the selection of machine learning algorithms. The aim is to find an optimal set of features that will ensure the highest system accuracy with minimum complexity in classifier building. Several NER approaches use a very large number of features (Mayfield et al. 2003), but the inclusion of additional features after a certain point can even yield worse results.
In this paper we use Stanford NER property files obtained in our previous research (Filipic et al. 2012) along with the findings about best performing settings for Croatian and Slovene which include POS, MSD and distributional similarity features. The only work we are aware of that examines the usage of distributional features in Stanford NER is (Faruqui, Pado 2010). The paper describes the process of building and optimizing NER models for German and
by using distributional features F1 is improved for 6% in-domain and 9% out-of-domain.
The latest approach in Croatian NER research, called CroNER (Glavas et al. 2012), is based on supervised learning using CRF. Observed classes include personal names, ethnics, percentages, locations, organizations, dates, and monetary and temporal expressions. The analysis of the Vjesnik corpus containing 310,000 tokens, reaches (exact) F1 measure of 87.42%. The system uses gazetteer-based features for personal names, ethnicity, city, state, street and organization names, and a rich set of lexical and morphological features specific for Croatian. Authors also defined a special named entity type to cover instances of possessive adjectives. According to the authors, two different methods for document-level consistency of NE labels are implemented: postprocessing rules (hard consistency constraint) and a two-stage CRF (soft consistency constraint). Post-processing rules are hand-crafted patterns designed to extract or re-label named entities omitted or misclassified by the CRF model. Two-stage CRF aims to consolidate NE label predictions on document and corpus level by employing a second CRF model that uses features computed from the output of the first CRF model.
NER model for Slovene (Stajner et al. 2012), developed using Mallet (McCallum 2002), also implements a CRF supervised learning algorithm. Research has been made on Slovene SSJ500k1 corpus annotated with morphosyntactic tags and three named entity classes. It has shown that the inclusion of morphosyntactic tag features benefits named entity extraction. The system reaches precision of 77% and recall of 76%, having stronger performance on personal and geographical named entities than on other entities, since the class of other entities (everything not being person or location) is very diverse and difficult to predict.
1 http://www.slovenscina.eu/tehnologije/ucni-korpus
3 CORPORA
Two initial datasets used in the first batch of our experiments, one Croatian (HR) and one Slovene (SL) were built during a student project (Filipic et al. 2012) from data taken from specific Internet domains from the Croatian and Slovene web corpora hrWaC and slWaC (Ljubesic, Erjavec 2011). The Croatian corpus (HR) contains 59,212 tokens taken from four different Internet domains covering two general newspaper portals, nacional.hr and jutarnji.hr, one ICT portal bug.hr and the business news portal poslovni.hr. The data was annotated during a student project in which data diversity was given special emphasis. The Slovene corpus (SL) is at almost two thirds the size of the Croatian one, containing 37,032 tokens from one general news portal rtvslo.si. While selecting the Slovene data the main goal was to build a usable dataset with limited annotation capacities.
We obviously live in exciting times for natural language processing in both Croatia and Slovenia because after finishing our initial batch of experiments, three additional datasets - two for Croatian and one for Slovene - have emerged, all published under quite permissive licenses.
The additional Croatian datasets include the SETimes and Vjesnik corpora. SETimes is a newspaper domain corpus consisting of general news articles written in Croatian language, originally extracted from the "Southeast European Times" web portal. It contains 178,982 tokens and has the highest density of named entities in the text. The Vjesnik corpus contains a collection of two main text domains - internal affairs and other text domains, evenly distributed between culture, foreign affairs and other news, lifestyle and sports. The Corpus contains 104,494 tokens. Text collection was performed by using a custom crawler, texts were further processed, i.e., cleaned, sentence split and tokenized by using Apache OpenNLP2 tools and POS/MSD annotated using the CroTag MSD-tagger (Agic et al. 2008). Related
http://opennlp.apache.org/
experiments with Croatian NER using the Vjesnik corpus are described in (Agic, Bekavac 2013).
The additional resource obtained for Slovene language is a part of the SSJ500k corpus. It is available under the Creative Commons CC-BY-NC-SA license. It is manually lemmatized and morphosyntactically tagged, and in part dependency parsed and annotated for named entities. Named entities are classified into four classes: names of persons, locations and organizations and other entities. In our research we use only the part of the corpus annotated with named entities which is 118,609 tokens in size. The amount of data in all datasets for both languages is given in Tables 1 and 2.
All corpora were tagged using the IOB2 standard following the CoNLL 2003 annotation guidelines3 where each row represents a token in the text with its linguistic annotation and designated predefined named entity category. IOB2 labels show whether a word is at the beginning (B), inside (I) or outside (O) of a named entity. Both initial datasets (HR and SL) were annotated with the four traditional categories - location (LOC), organization (ORG), person (PERS) and miscellaneous (MISC). The additional Slovene dataset (SSJ) contains the same three categories while the two additional datasets in Croatian (SETimes and Vjesnik) have only the basic three categories annotated - location (LOC), organization (ORG) and person (PERS). Possessive adjectives indicating named entities are additionally annotated in the SETimes, Vjesnik and SSJ datasets as it is the case in the initial HR and SL datasets (Filipic et al. 2012).
Basic part-of-speech (first letter of the Multext-East MSD) (Erjavec et al. 2003) on the HR corpus was manually annotated since related work shows that these features are useful for the task. Slovene SL corpus was MSD tagged and lemmatized with the freely available ToTaLe tagger (Erjavec et al. 2005) trained on JOS corpus data (Erjavec et al. 2010).
http: //www.cnts.ua.ac.be/conll2003/ner/annotation.txt
	HR	SETimes	Vjesnik
Token #	59,212	178,982	104,494
ORG	839	4,686	1,875
PERS	602	3,761	2,317
LOC	590	5,746	2,055
MISC	632	-	-
ALL	2,663	14193	6247
Density	0.045	0.0793	0.0598
Table 1: Size of the Croatian corpora		and the number of annotated named entities.	
	SL	SSJ	
Token #	37,032	118,609	
ORG	311	804	
PERS	1,086	2,008	
LOC	716	1,284	
MISC	378	406	
ALL	2,491	4,502	
Density	0.067	0.038	
Table 2: Size of the Slovene corpora and the number of annotated named entities.
To be able to use POS information on unseen Croatian data, we trained a model for the HunPos tagger (Halacsy et al. 2007) from the initial Croatian dataset. We performed a simple test of the resulting model by dividing the dataset into training and test set with a 9:1 ratio. Accuracy obtained on the test set was 95.1%. We publish the tagger trained on all available data along with the NER models and the benchmark datasets. To our knowledge, this is the first freely available part-of-speech tagger for Croatian.4
The expected difference in diversity of the initial datasets can be clearly observed from the amount of annotated named entities for each corpus. First of all, although the SL corpus has 37% less textual material than HR, it has just 6% less named entities showing a higher density of named entities one
4 A full MSD tagger and lemmatizer were developed and published recently on http ://nlp.ffzg.hr/resources/ models /.
would expect from a straightforward newspaper dataset. Furthermore, when we look at the type of named entities, we can observe that the SL dataset contains many more person names and slightly more locations while the Croatian dataset contains more organization names and named entities labelled with the miscellaneous category. This data confirms our assumption that the Croatian dataset is much more diverse and will thereby present a harder task for supervised classification in the initial experiments.
The additional Croatian datasets are newspaper datasets and show a non-surprising high named entity density in the text. The density is even higher than the one in the initial Croatian dataset regardless of the fact that additional datasets do not contain the miscellaneous category.
The additional Slovene dataset is rather named-entity-sparse having half of the initial Slovene dataset density. This should not come as a surprise since the SSJ dataset is a reference corpus and therefore does not contain only newspaper texts as all other datasets do.
We divided the initial corpora into development and test sets by shuffling documents and producing test sets of similar size for both languages. The decision to build test sets of similar size was guided by the idea of publishing those test sets as benchmark datasets for both languages. For that reason the HR development set contains 53,142 tokens while the SL one contains 29,686 tokens, i.e. 56% of the amount of Croatian data.
An additional insight into the features and thereby specificities of the two initial datasets is given by calculating vocabulary transfer between identical portions of development and test sets. The numbers are given in Table 3. Vocabulary transfer is calculated as token and type percentage of named entities in the test set being already present in the development set. Two interesting properties can be observed here. First of all, the Slovene vocabulary transfer is higher than the Croatian one pointing at the expected lower content diversity of Slovene data. Secondly, there is almost no
difference between token and type transfer on Croatian data showing that the diversity of named entities is really high. Namely, this points to the fact that almost none of the named entities from the development set present in the test set appears more than once in the Croatian test set which is not the case in Slovene data.
Corpus	Token transfer Type transfer
HR 10.7% 10.6% SL_17.3%_12.4%_
Table 3: Vocabulary transfer for initial corpora on identical portions of development and test set.
We built additional test sets once we obtained the additional datasets by adding similar amount of information from each dataset to the joint test set. The Croatian test set includes 6,730 tokens from the Vjesnik corpus, and 6,736 tokens from the SETimes corpus along with the previously mentioned initial HR test set. Since the MISC category is not present in the additional Croatian datasets, the extended Croatian dataset naturally does not contain that category. Slovene additional test set contains 6,981 tokens from SSJ corpus, along with the initial SL test dataset. The MISC category is retained in that test set since both datasets contain it.
For calculating distributional similarity of tokens from large monolingual corpora, portions of hrWaC and slWaC web corpora were used. For Croatian we built a 100Mw corpus and for Slovene a 50Mw corpus, both containing data from large news portals.
4 EXPERIMENTAL SETUP
Since different annotation levels on initial Croatian and Slovene datasets were available, in the first batch of experiments we evaluated different settings for each language on the HR and SL corpora. Besides part-of-speech information for both languages, on SL data MSD and lemma information was present as well.
On HR data we experimented with POS information ("POS"), distributional information ("DISTSIM") calculated from îoMw, 50MW and îooMw corpora while on Slovene data we experimented with POS, MSD ("MSD") and lemma ("LEMMA") information and distributional information obtained from 10Mw and 50Mw corpora. Thereby we performed 8 initial experiments on HR data and 11 initial experiments on SL data (we eliminated the experiments varying with availability of lemma information once it proved to be non-informative).
All the experiments were performed on development sets of both datasets via 5-fold cross-validation that takes into account document borders. By respecting document borders we were trying to keep the vocabulary transfer as low as possible and thereby obtain the most realistic results, i.e., differences between different experimental settings. Distributional similarity was calculated by using Clark's cluster_neyessen tool (Clark 2003) with default settings (numberStates=5, frequencyCutoff=5, iterations= 10). The number of resulting clusters was set on best-performing values in (Faruqui, Pado 2010), i.e., for 10Mw corpora 100 clusters and for 50Mw and 100Mw corpora 400 clusters were built. First twenty elements of example clusters calculated from the Croatian 100Mw and Slovene 50Mw corpora are given in Table 4. The Croatian cluster contains exclusively country and city names in the locative (or dative) case. The Slovene cluster contains first names of people in the nominative case of both Slovene and English origin. These examples show very clearly how the cluster ID can be used as a very informative feature in the supervised training procedure.
After identifying the best performing settings on the development sets we calculated our final results by training a system on the whole development set and testing it on the left-out initial test set.
njemačkoj rijeci londonu sarajevu osijeku italiji zadru francuskoj haagu austriji parizu dubrovniku vukovaru španjolskoj milanu bruxellesu rimu beču moskvi berlinu
tomaž simon goran martina dejan jan nina tom saša mojca vesna jurij eva nataša maria jernej daniel richard thomas damjan žiga
Table 4: First 20 elements of sample clusters obtained with Clark's tool on the 100Mw Croatian and 50Mw Slovene corpus.*
* The Croatian cluster contains exclusively country and city names, and the Slovene cluster contains first names of people of both Slovene and English origin.
Obtaining additional datasets for both languages at a later point enabled us to perform an additional batch of experiments and re-examine our findings in the initial experiments. We built additional test sets containing left-out information from all datasets as described in the previous section. We performed calculations on the few most promising settings from the initial batch of experiments. The important difference between the two languages in the second batch of experiments is that the Croatian dataset does not contain the miscellaneous category while the Slovene dataset does.
Finally we compared results obtained with different amounts of annotated data on both languages with the best performing settings to identify the gain we can expect from adding more annotated data.
5 RESULTS OF THE FIRST BATCH OF EXPERIMENTS
The results obtained by 5-fold cross-validation on both development sets are presented for Croatian in Figure 1 and for Slovene in Figure 2. The results of each cross-validation are averaged by calculating their harmonic mean. Regarding the statistical significance of the results, we perform a one-tailed paired t-test over pairs of results we find interesting. On Croatian results we can observe already in the second experiment (POS) that basic morphological information in this simple setting improves Fi for 4.5% (p = 0.002). Our third experiment (DISTSIM 10M) shows that using distributional information
obtained from a 10 million token corpus improves the result as part-of-speech information with similar significance (p = combining both features we improve our results for 8.5%, more in comparison with using only one of those features (p < 0.001).
I KM
0ISTBIM1WM POS0l3T£lM rJIM DISTSIM It'.1 P3SDIST5IM 10M OlSTSlM lOU POS CL£*H
C 52 C.E4. 0.56 DISS 0.5Q OJ63 F1
Figure 1: F1 results obtained via 5-fold cross-validation on Croatian development set.
By calculating distributional information on five and ten times more data we get improvements of 2% and 3% when not using part-of-speech information and 1% and 2% when using part-of-speech information. The differences between neighbouring corpus sizes (10 and 50; 50 and 100) are not statistically significant, but the differences between using 10Mw and 100Mw corpora are (p = 0.007). We see a steady rise in performance as the unlabelled monolingual corpus size increases, motivating us to perform similar calculations on much larger datasets in the future.
The results on Slovene data regarding the categories present in Croatian data are rather similar backing up those findings. There are two types of information in Slovene data we did not have for Croatian - MSD and lemma. By using MSD and not only POS information the results do improve for additional 1%, but statistically insignificantly (p = 0.21). On the contrary, by adding lemma information to the MSD decreases the result significantly for
much as the 0.005). By significantly
5.5% (p = 0.007). One could expect such an outcome since lemmatization performs worst on named entities. Adding more distributional information by moving from a 10Mw to a 50Mw corpus we achieve an improvement of 5% which is even steeper than the one obtained on Croatian data, now highly significant (p < 0.001).
M5D DIETS« SIM t--os [lis rs:w sjm DISTSIH KM MID D1STC1MICM PCS mp rs M tGM □ISTSIM 1CM
MSO
POSLCMMA
pas
CLEAN
0 66 0	0 7D CI.IJ3 D71
Fi
Figure 2: F1 results obtained via 5-fold cross-validation on Slovene development set.
This could be explained by the higher simplicity and similarity of this dataset to the monolingual corpus used for distributional similarity calculation, pointing to a conclusion that for datasets of narrower domains additional data sources such as this one give more improvement. We can observe on both datasets that, when using distributional similarity from larger corpora, including additional features like POS or MSD accounts for a lower increase in the results.
When comparing results on Croatian and Slovene datasets one observes right away that the results on Slovene data are much better although the size of the dataset is below half the size. This can be traced back to the fact that the Slovene dataset contains a narrower domain, has a higher vocabulary transfer and a higher amount of named entities like person and location which are considered easier to recognize and classify. On the other hand the resulting
Croatian module is expected to be more robust and should perform better on different domains.
HR DISTSIM 100Mw						
Entity	P	R	F1	TP	FP	FN
LOC	0.8049	0.7021	0.7500	33	8	14
MISC	0.7436	0.3867	0.5088	29	10	46
ORG	0.6742	0.6250	0.6486	60	29	36
PERS	0.9032	0.5185	0.6588	28	3	26
Totals	0.7500	0.5515	0.6356	150	50	122
HR POS DISTSIM 100Mw						
Entity	P	R	F1	TP	FP	FN
LOC	0.8293	0.7234	0.7727	34	7	13
MISC	0.7778	0.4667	0.5833	35	10	40
ORG	0.6989	0.6771	0.6878	65	28	31
PERS	0.8500	0.6296	0.7234	34	6	20
Totals	0.7671	0.6176	0.6843	168	51	104
SL DISTSIM 50Mw						
Entity	P	R	F1	TP	FP	FN
LOC	0.7423	0.7273	0.7347	72	25	27
MISC	0.5000	0.2143	0.3000	15	15	55
ORG	0.8947	0.3617	0.5152	17	2	30
PERS	0.8966	0.8509	0.8731	234	27	41
Totals	0.8305	0.6884	0.7528	338	69	153
SL MSD DISTSIM 50Mw						
Entity	P	R	F1	TP	FP	FN
LOC	0.7957	0.7475	0.7708	74	19	25
MISC	0.4688	0.2419	0.3191	15	17	47
ORG	0.8947	0.3617	0.5152	17	2	30
PERS	0.8619	0.8400	0.8508	231	37	44
Totals	0.8180	0.6977	0.7531	337	75	146
Table 5: Test results on the four best performing models (P - precision, R - recall, F1 -F1 measure, TP - true positives, FP - false positives, FN - false negatives).
The results given in Figures 1 and 2 are obtained via cross-validation by
evaluating five models built on different data on five different evaluation sets. We chose two settings per dataset for final testing on the left-out test set. The first one uses distributional information, but leaves out the need for morphological annotation of the data while the second one uses both distributional and morphological information. We present the results of precision, recall, F1, true positives and false positives and negatives by category in Table 5.
The number of false negatives shows to be on both datasets and settings higher than the number of false positives with higher percentage of precision than recall as a direct consequence. On Slovene data the best performing categories are PERS, LOC, ORG and then MISC. On Croatian data LOC tends to perform best, ORG and PERS forming a tie and MISC being traditionally the worst category. The somewhat unexpected order of category performance on the Croatian dataset can probably be followed to the wider domain of that dataset.
6 RESULTS OF THE SECOND BATCH OF EXPERIMENTS
In the second batch of experiments we used the secondary test sets consisting of left-out parts of all datasets used for training the models. We experimented with using part of speech and distributional information since these features showed to be most promising in the first batch of experiments. An additional reason not to include full MSD information is the result of an experiment where we assessed the usability of the MSD information on larger datasets such as the SETimes corpus which is partially manually and partially automatically annotated with full morphosyntactic information. We used 5fold cross-validation on the dataset and the differences between using POS and MSD were consistently below 1%. This goes in line with our overall findings in the initial batch of experiments where additional linguistic features were losing importance when increasing the amount of annotated data.
The results for specific datasets are given in Table 6. The distributional
information proves to be of greater importance than the part-of-speech information on all datasets. Combining those two does just slightly improve the results when compared to using only distributional information. This is a very usable finding since it enables us to build final models that do not rely on part-of-speech information and thereby do not require such pre-processing.
	HR	Vjesnik	SETimes	SL	SSJ	HARMN
CLEAN	0.525	0.721	0.801	0.579	0.598	0.630
POS	0.577	0.732	0.811	0.573	0.587	0.642
DIST	0.624	0.796	0.844	0.635	0.666	0.702
POS DIST 0.663		0.786	0.846	0.641	0.647	0.707
Table 6: Fl test results on all available corpora based on secondary test sets.
Distributional similarity shows better performance on smaller datasets, with the difference on HR being 9.95%, on the Vjesnik dataset 7.53% and on the largest and densest SETimes dataset just 4.35%. POS features seem to lose on their significance when using bigger datasets as well.
The SSJ corpus, although almost 4 times bigger than the SL corpus, shows just slightly greater performance in recognizing named entities. The reason for this result is the low density of named entities in that dataset showing that newspaper corpora are better reference corpora for annotating and modelling this phenomenon.
The overall lower results on the Slovene datasets can be followed to their smaller size, inclusion of the miscellaneous category and the lower usefulness of the SSJ dataset for the task at hand.
The harmonic mean of all F1 measures on all corpora actually sums up our main findings - POS information has a small positive impact while DIST information has a very significant impact by improving the result for 7 points. Combining those two does not yield enough improvement to justify the preprocessing step of part-of-speech tagging.
We performed one final experiment where we combined all Croatian and all Slovene datasets into one dataset per language. Those results are shown in Figure 3 which depicts the Fi evaluation result as a function of the number of annotated named entities in each dataset. This plot shows the possible impact of adding more data for training the model as well.
n HR C Vjesruk A SETinci
-i HEl*Vieinfc»5ErTliiie5-+ Slit SSJ * SL-SSJ
T--1-1-
100» iwroo i«»o
nunLur Of NG-i
Figure 3: Fi measure as a function of the number of named entities.
On Croatian datasets we can observe a typical logarithmic behaviour with obvious room for improvement by annotating even more data. The Slovene datasets, when compared to the Croatian ones, are all small and near to each other regarding the amount of annotated named entities and are obviously still in the strong growth phase so building a larger Slovene dataset should be an even higher priority than for Croatian. The slower rise of the Slovene learning curve can be followed to two specificities of those datasets - inclusion of the miscellaneous category and smaller density of named entities providing a smaller amount of positive examples in the training set and leaving more room for errors in the test set.
Detailed results on the combined corpora are given in Table 7. Combined
corpora for Croatian with DISTSIM features and three named entity classes (ORG, PERS, LOC) yielded an Fi score of 89.8%. The results showed high recall for all named entity categories. The best performing category was LOC, followed by PERS and lastly ORG.
HR + Vjesnik + SETimes DISTSIM 100Mw
Entity	P	R	F1	TP	FP	FN
LOC	0.9056	0.9467	0.9257	355	37	20
ORG	0.8875	0.8282	0.8568	347	44	72
PERS	0.9083	0.9269	0.9175	317	32	25
Totals	0.9002	0.8970	0.8986	1019	113	117
SL + SSJ DISTSIM 50Mw						
Entity	P	R	F1	TP	FP	FN
LOC	0.6794	0.8114	0.7396	142	67	33
MISC	0.2917	0.1538	0.2014	14	34	77
ORG	0.7391	0.3493	0.4744	51	18	95
PERS	0.8224	0.8674	0.8443	301	65	46
Totals	0.7341	0.6693	0.7002	508	184	251
Table 7: Test results on combined corpora (P - precision, R - recall, Fi - Fi measure, TP - true positives, FP - false positives, FN - false negatives).
Combined corpora for Slovene with DISTSIM features and four named entity classes (ORG, PERS, LOC, MISC) showed a lower improvement, due to corpora size and the number of observed named entity classes. The overall Fi result was 70.02%. The hardest class observed is expectedly MISC, and the best results are obtained for the PERS class. A low recall for ORG and especially MISC class could indicate the system's partial inability to distinguish between those two classes.
7 CONCLUSION
In this paper we have presented the process of building freely available models for named entity recognition and classification for Croatian and Slovene. We have built two initial datasets, one for Croatian which is larger
and covers a broader domain and one for Slovene which is smaller but covers just the general news domain. We were searching for the optimal set of features on the development set via five-fold cross-validation. Lemmata have shown to be of no use for a morphologically complex language such as Slovene since lemmatization tends to work worst on word classes such as named entities. On the other hand morphological information such as POS tags or full MSD tags proved to be valuable with the latter being more informative. That type of information improved the F1 measure in a 3-5% window. Clustering tokens from a large monolingual corpus by features such as contextual and morphological properties has proven to be beneficial improving the results for 3-4% by using loMw corpora. With clustering results from larger corpora the results continue to improve steadily. Combining both morphological and clustering information proved to be the winning combination with an overall improvement of 10% on datasets of both languages. By omitting morphological information for which pre-processing is required we still get an improvement of 8%.
The second batch of experiments included two additional datasets for Croatian and one for Slovene. By repeating the most promising settings from the first batch on this collection of datasets we managed to gain a better insight in the best performing settings. The results have shown that the impact of part-of-speech information is much lower than the one of distributional similarity. Both features lose importance as the dataset size increases. Combining these two features proved to be very similar to using distributional information only and this is the reason why the final models we publish do not require part-of-speech tagging, but have the distributional information included. By analysing the relation between dataset size and the obtained results we conclude that for both languages additional annotated data would yield improvement. The Slovene model could especially be easily improved with additional data of higher named entity density than the SSJ corpus.
Finally we are releasing three models - two for Croatian and one for Slovene, all of them using only distributional information as an additional feature and thereby not relying on any pre-processing but tokenization. Out of the two Croatian models one does cover the MISC category, but is trained on a much smaller amount of data and the other does not cover the MISC category, but is trained on a much larger amount of data and thereby more accurate. The Slovene model covers all four traditional categories. The models - together with the initial and the extended test sets - can be obtained from
http: / / nlp.ffzg. hr/resources / models/ ner/.
For the future our plan is to increase the amount of annotated data for training by exploiting semi-supervised approaches and add the MISC category to the whole dataset. Additionally we plan to calculate distributional similarity on larger corpora and take under consideration variations of the distributional similarity method used in this paper.
REFERENCES
Agic, Ž., and Bekavac, B. (2013): Domain Dependence of Statistical Named
Entity Recognition and Classification in Croatian Texts. Proceedings of the 35th International Conference on Information Technology Interfaces (ITI2013): 277-282. Cavtat.
Agic, Ž., Dovedan, Z., and Tadic, M. (2008): Improving Part-of-Speech
Tagging Accuracy for Croatian by Morphological Analysis. Informatica, 32(4): 445-451.
Bekavac, B. (2005): Strojno prepoznavanje naziva u suvremenim hrvatskim tekstovima: Ph.D. Thesis. Zagreb: University of Zagreb.
Bošnjak, M. (2007): Strojno prepoznavanje naziva tehnikama strojnog učenja: Master's Thesis. Zagreb: University of Zagreb.
Clark, A. (2003): Combining Distributional and Morphological Information for Part of Speech Induction. Proceedings of the Conference of the
European Chapter of the Association for Computational Linguistics: 59-66. Budapest.
Erjavec, T., Fišer, D., Krek, S., and Ledinek, N. (2010): The JOS Linguistically Tagged Corpus of Slovene. International Conference on Language Resources and Evaluation: 1806-1809. Valetta.
Erjavec, T., Ignat, C., Poliquen, B., and Steinberger, R. (2005): Massive
Multilingual Corpus Compilation: Acquis Communautaire and ToTaLe. The 2nd Language & Technology Conference - Human Language Technologies as a Challenge for Computer Science and Linguistics: 32-36. Poznan.
Erjavec, T., Krstev, C., Petkevič, V., Simov, K., Tadic, M., and Vitas, D. (2003): The MULTEXT-East Morphosyntactic Specifications for Slavic Languages. MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing: 25-32. Stroudsburg.
Faruqui, M., and Pado, S. (2010): Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. Proceedings of the 10th Conference on Natural Language Processing (KONVENS) 2010. Saarbrücken.
Finkel, J. R., Grenager, T., and Manning, C. (2005): Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005): 363-370. Stroudsburg.
Filipic, L., Juric, T., and Stupar, M. (2012): Strojno prepoznavanje naziva u tekstovima pisanima hrvatskim jezikom. Zagreb: Sveučilište u Zagrebu.
Glavaš, G., Karan, M., Šaric, F., Šnajder, J., Mijic, J., Šilic, A., and Dalbelo
Bašic, B. (2012): CroNER: A State-of-the-Art Named Entity Recognition and Classification for Croatian. Proceedings of the Eighth Language
Technologies Conference: 73-78. Ljubljana.
Halacsy, P., Komai, A., and Oravecz, C. (2007): HunPos: an Open Source
Trigram Tagger. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL '07: 209-212. Stroudsburg.
Ljubešic, N., and Erjavec, T. (2011): hrWaC and slWac: Compiling Web
Corpora for Croatian and Slovene. Text, Speech and Dialogue -14th International Conference, TSD 2011: 395-402. Pilsen.
McCallum, A. K. (2002): Mallet: A Machine Learning for Language Toolkit. Available at: http://mallet.cs.umass.edu ( 5 June 2013).
Štajner, T., Erjavec T., and Krek, S. (2012): Razpoznavanje imenskih entitet v slovenskem besedilu. Proceedings of 15th International Multiconference on Information Society - Jezikovne tehnologije: 191-197. Ljubljana.
Vitas, D., and Pavlovic-Lažetic G. (2008): Resources and Methods for Named Entity Recognition in Serbian. INFOtheca - Journal of Informatics and Librarianship, 9(1-2): 35a.
IZGRADNJA MODELOV ZA PREPOZNAVANJE IMENSKI H ENTITET ZA HRVAŠČI NO IN SLOVENŠČINO
Prispevek predstavlja razvoj prosto dostopnih modelov za prepoznavanje in klasifikacijo imenskih enot za hrvaški in slovenski jezik. Poskusi se osredotočajo na najbolj informativne jezikovne lastnosti, pri čemer upoštevajo dostopnost jezikovnih orodij za oba jezika. Poleg standardnih jezikovnih lastnosti so upoštevane tudi distribucijske lastnosti, ki so bile izračunane iz velikih neoznačenih enojezičnih korpusov. Uporaba distribucijskih lastnosti izboljša rezultate za 7-8 točk v meri F1, uporaba oblikoslovnih informacij pa dodatno za 3-4 točke, in to pri obeh jezikih. Najboljši naučeni model skupaj s testno množico za primerjavo z obstoječimi in bodočimi sistemi ter model za oblikoslovno označevanje hrvaščine s programom HunPos so dostopni za prenos za uporabo v znanstvene in komercialne namene.
Ključne besede: prepoznavanje imenskih entitet, distribucijske lastnosti, hrvaščina, slovenščina
To delo je ponujeno pod licenco Creative Commons: Priznanje avtorstva-Deljenje pod enakimi pogoji 2.5 Slovenija.
This work is licensed under the Creative Commons Attribution ShareAlike 2.5
License Slovenia.
http://creativecommons.org/licenses/by-sa/2.5/si/