y GEODETSKI V E S T NIK | letu / Vol. 60 | št. / No. 3 V J 160/31 3 PREUČEVANJE POPOLNOSTI NOVEL TOOL FOR PODATKOV NA PODLAGI EXAMINATION OF DATA PRIMERJALNE ANALIZE MED COMPLETENESS BASED ON PODATKI VGI IN URADNIMI A COMPARATIVE STUDY OF PODATKOVNIMI NIZI O VGI DATA AND OFFICIAL STAVBAH BUILDING DATASETS Joanna Nowak Da Costa UDK: 551.506:725.1 Klasifikacija prispevka po COBISS.SI: 1.01 Prispelo: 2. 3. 2016 Sprejeto: 9. 9. 2016 DOI: 10.15292/geodetski-vestnik.2016.03.495-508 SCIENTIFIC ARTICLE Received: 2. 3. 2016 Accepted: 9. 9. 2016 _ IZVLEČEK Namen študije je bil prispevati k boljšemu razumevanju kakovosti prostovoljno zbranih geografskih informacij VGI (angl. volunteeredgeographic information) in njihovih koristi. Pri raziskavi smo se osredotočili na možnost uporabe podatkov o stavbah OpenStreetMap za uradne prostorske podatkovne nize. Z vidika pojavne popolnosti podatkov so ugotovitve raziskave primerljive z rezultati podobnih izvedenih študij. Ugotovili smo, daje popolnost podatkov o stavbah z vidika deleža zajetih stavb relativno visoka v središčih mest, z oddaljenostjo od urbanih središč pa se manjša. Prav tako se je izkazalo, da je popolnost opisnih podatkov o stavbah odvisna od stopnje urbanizacije, dodatno pa še od vrste opisnega podatka. Srednja položajna točnost podatkov o stavbah zbirke OpenStreetMap je za urbana območja ocenjena z 0,6 metra, za podeželje pa z 1,7 metra. Ta ocena je več kot petkrat boljša kot pogosto navedena ocena kakovosti podatkov OpenStreetMap, ki jo je objavil Haklay v letu 2010. V prispevku predstavljamo nov pristop v podporo oceni popolnosti podatkov OpenStreetMap, ki se nanašajo na stavbe. Predlagani kazalnik, ki smo ga poimenovali popolnost ujemanja objekta na podlagi površine (angl. matching feature area-based completeness), omogoča oceno popolnosti podatkov za kakršenkoli ploskovni prostorski podatkovni niz. Kazalnik je tudi prilagodljiv, saj ni pogojen z modeliranjem oboda ploskovnega objekta, niti ne s stopnjo posploševanja. Dodatno je predlagana preprosta metoda za posodabljanje uradnih evidenc o stavbah na podlagi množice podatkov OpenStreetMap. KLJUČNE BESEDE kakovost, prostorski podatki, popolnost podatkov, OpenStreetMap, prostovoljno zbrane geografske informacije ABSTRACT _ The goal of this study was a better understanding of the quality of Volunteered Geographic Information (VGI), and by extension its utility. The research focused on the applicability of OpenStreetMap (OSM) building data for official spatial databases. In terms of feature completeness, the achieved results are in-line with other similar studies. The study concluded that in town centres the completeness of OSM data is relatively high but decreases further away from urban centres. It demonstrated that attribute completeness also relies on the level ofurbanization as well as the nature of attribute. Furthermore, a very high overall positional accuracy was determined for OSM building data that ranged between 0.6 m in urban areas and 1.7 m in rural areas. This result is more than five times better than the frequently cited OSM accuracy results obtained by Haklay in 2010. In this work, a novel tool is introduced to help assess the completeness of OSM building-tagged features. The proposed index, called the matching feature area-based completeness, estimates the completeness of any areal feature set. This index is also flexible because it is neither affected by discrepancies in the feature outline modelling nor by the degree of abstraction. In addition, the author proposed a simple method to update the official register using the large volume of OSM building data "over-completeness" together with the building data excess indicator. key words _ quality, spatial data, data completeness, OpenStreetMap, volunteered geographic information Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | | 495 | |60/3| GEODETSKI VESTNIK ^ 1 INTRODUCTION § The development and spreading of information and communication technologies along with the growing S ability of the public to use them, remained not without impact on geospatial mapping. It appears that % virtually everyone, regardless of their education, knowledge and experience, can collect spatial information, =■ for example while walking or cycling with a GPS equipped mobile phone, and produce social network •rf maps. This trend was defined as "neogeography" (Turner, 2006), "crowd sourcing" (in the Web 2.0 set-g ting), or Volunteered Geographic Information (VGI). The latter term is used particularly in relation to ^ spatial data collected voluntarily and free of charge by a large number of volunteers (Goodchild, 2007). The OpenSteetMap (OSM) initiative, the most extensive VGI representative in terms of the number of involved users and the volume of data created, has already gained academic research and commercial interests. However, since the geospatial contributions, skill level and motivations of OSM communities change over time, therefore, monitoring and updated data quality research are necessary to understand the applicability of this important dataset. The data quality is relative to the users' needs and it is neither independent nor absolute (Cooper et al., 2012; Bielecka et al., 2014). The aim of the study was to understand the applicability of OpenStreetMap building data and to assess its quality specifically with regards to its potential use as complementary or input data for official spatial databases. It focuses in particular on the Polish Database of Topographic Objects for buildings in the ^ Polish county of Siedlce. The study introduces a novel tool that helps to assess the completeness of OSM building-tagged features. The proposed index estimates the completeness of any areal feature set, and it is neither affected by discrepancies in the feature outline modelling nor by the degree of abstraction. Furthermore, the study proposes a simple method to achieve improved updating of official data based on the volume of building data missing from official database, that is OSM "over-completeness". First, the paper reviews related research in Section 2. Next, it presents the method chosen for the OSM building data quality analysis in Section 3. In Section 4, the paper introduces study area and datasets characteristics. In Sections 5 through 7 the study focuses on thematic and positional accuracy as well as feature completeness. The paper addresses its concluding remarks in Section 8. 2 RELATED WORK The OpenStreetMap (OSM) project, whose mission is to create a free, digital, open and editable map of the world, and provide a ready-made map or geographic dataset to anyone who wants it, bases on contributions from volunteers. The user-generated geographic information involves many forms of contribution such as online mapping or georeferencing of existing data sources like aerial image, as well as, the collection of data through the user's location-enabled smartphone. The OSM's approach to creating and managing map and geographic dataset was rather intuitive than calculated (Coote and Rackham, 2008) which caused concern among GI experts regarding the quality and usability of such data. As a result, the OSM, and, in general VGI, data credibility and quality are being increasingly studied by researchers (Elwood et al., 2012; Flanagin and Metzger, 2008). Road network is the most frequently analysed OSM data. In most cases, OSM roads data was compared to official datasets. However, the choice of a reference data for quality control of data collected by nonprofessional land surveyors is problematic because of its heterogeneity, as noted also by Goodchild and Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O | 496 | STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | GEODETSKI VESTNIK |60/3| Li (2012), Haklay (2010), Goodchild and Glennon (2010), and Dom et al. (2015). Studies of OSM ^ roads completeness concluded that it is heterogeneous and much higher in big cities, lower in towns, ¡= ' and the lowest in rural areas (Haklay, 2010; Girres and Touya, 2010; Esmaili et al., 2013; Zielstra et § al., 2013). In terms of positional accuracy of OSM road data, they concluded that some areas are well s mapped, however with a tight relation of completeness and urbanization. According to the first ever systematic study, and one of the most cited study, conducted by Haklay in 2010, OpenStreetMap data 5 was, on average, within about 3.2 to 4.8 metres of the position recorded by Ordnance Survey in the g centre of London. However, the average in the peripheral districts dropped to 6.8—8.3 meters and the S maximum deviations reached 20 meters. Despite its name, OpenStreetMap is not just a road map; it provides topographic data including buildings. Recently, OSM building data quality has been tested using German and Austrian official data. OSM building completeness was found to be higher in urban areas in comparison with rural ones, but still low (Hecht et al., 2013; Klonner et al., 2014). According to another research on quality assessment of OSM building footprints data in Germany, data was characterised to have a high completeness in terms of area covered, but with limited attributive information, such as building types. While with respect to shape, OSM building footprints have high similarity to those in the German administrative dataset. And there is an offset of about four meters in average in terms of position accuracy (Fan's et al., 2014). To sum up, many researchers agree that the main advantage of VGI data quality is its good geometric ^ accuracy, while its geographic coverage patchwork and inconsistent semantics are its drawbacks (Good-child, 2007; Ballatore et al., 2013; Mooney and Corcoran, 2012). 3 METHODOLOGY The OSM quality analysis focus was on three out of six data quality elements outlined in the current spatial data quality standard, ISO 19157 (2013), namely: completeness, positional and thematic accuracy. The OSM data was compared with the third-party dataset (extrinsic approach), that is the official topographic dataset administered by the Polish Mapping Agency. The volume of attributive information such as building types and their proper names was calculated for all OSM building features to quantify attribute completeness, a data quality element of thematic accuracy. This automatic procedure was followed by attribute accuracy evaluation based on manual arbitrary comparison of the attributes of corresponding features. The positional accuracy analysis was based on a manual measurement of the building corner points within OSM dataset and their corresponding points within the reference dataset. The measurements were performed on a fair random sample of OSM buildings evenly distributed within the urban and rural test areas. Spatial accuracy was quantified using the Root Mean Square Error (RMSE). The resulting high compatibility between the position of building footprints in OSM and the reference set created the basis for automated matching algorithm choice. The feature matching step was a part of the feature completeness investigation. To achieve more reliable results of OSM building completeness analysis, the logical and semantic heterogeneity between two compared datasets were minimised in advance. Moreover, the official data was not considered as the only legitimate reference. Consequently, a novel tool was introduced to help assess the completeness of any dataset of polygon features. Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | | 497 |60/3| GEODETSKI VESTNIK 4 STUDY AREA AND DATASETS The test area, situated in the central-eastern Poland, consists of two sub-areas: Siedlce town (urban district) covering less than 32 sq.km area and a fifty-fold greater area, Siedlce district (rural district) (see Table 1 for their basic characteristics). Siedlce is an average Polish town in terms of both the demographics and economic development. On the other hand, the Siedlce district, surrounding the town, might be described as a poorly urbanized and rather loosely populated area composed of 13 rural communes. Table 1: The general characteristics of the test sub-areas for the study area Siedleckie County. District name District type Total population Population density [people per km2] Total area [km2] Area after agricultural land and forests deduction [km2] Siedlce district rural 81,811 51 1603.3 129.2 Siedlce town urban 76,603 2,404 31.8 31.8 The test data consists of OSM building-tagged features obtained from the OSM web service, Geofabrik (www.geofabrik.de), in the ESRI shape format. It reflects the state as of May 28, 2015. The examined OSM dataset contains 24,000 objects represented by polygon, most of which lies in Siedlce town (21,434). Table 2: An overview of the datasets used in this study. OSM building data BDOTlOk building data Definition Mapping rule Data capture procedure Accuracy, level of detail Quality control Up-to-datedness Missing. No strict rules, recommendations only: If possible outer edge of the building wall should be mapped. The outline of building blocks or other complex arrangement of properties allowed. GPS equipped cell phone or other handheld GPS-device, aerial ortophoto vectorization, sketch drawing from street level, data import from available spatial data sources. Heterogeneous accuracy and level of detail, depending on the data collection method, generalization level of a building outline, and the contributor's skills and experience. Respect for the OSM consensus norms community, e.g. code of conduct, good practices; Often: geometric and descriptive data verification by introducing a new measurement by any OSM contributor; Potential: intrinsic quality checks (OSM, 2015a, 2015c) using available tools, Heterogeneous. Intended to be continuously up-to-date; depends on the contributors' activity, Unambiguous definition (MSWiA, 2011). Building footprint or maximal outline. Land and Property Register or other state registers, professional land surveying or ortophoto vectorization. The level of detail and accuracy equivalent to the scale of 1:10,000. Measurement rules and technical supervision over measurements, as well as a system to control data (topology and geometry checks, semantic, syntactic and attribute checks, etc.), Homogeneous; kept up-to-date (in practice, updating on a yearly basis), As a reference data, Polish Database ofTopographic Objects (BDOT10k), maintained by the Head Office of Geodesy and Cartography in Poland, were used. BDOT10k is a spatially continuous, vector database Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O | 498 | STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | GEODETSKI VESTNIK |60/3| with the thematic scope and a level of detail corresponding to contemporary, civilian topographic maps at a scale of 1:10,000. The ESRI shape format data subset, whose last revision date was August 31,2013, was provided. OSM buildings were compared with the objects belonging to the BD0T10k group of object classes called 'buildings, building structures and facilities'. In particular, the areal features from the following classes were mainly involved: buildings (BUBD), sports facilities (BUSP), high technical building structures (BUWT), other technical facilities (BUIT), and several objects from the OIOR class of small building structures of topographical or landmark importance. The both studied datasets differ much in data collection and management approaches as can be seen from Table 2 that summarizes their selected characteristics. The datasets pre-treatment regarded the spatial reference harmonization by using a common coordinate system. The projected Cartesian Gauss-Kruger coordinate system ETRS 1989 UWPP 1992, which usually serves as a spatial reference for topographic mapping in Poland, was chosen. 5 THEMATIC ACCURACY The ISO data quality standard defines thematic accuracy as the accuracy of quantitative attributes and the correctness of non-quantitative attributes, of the classifications of features and of their relationships (ISO, 2013). The author of this paper agrees with Koukoletsos (2012) that in the VGI context, thematic accuracy encompasses mostly attribute accuracy along with attribute completeness. The latter needs to be examined here because of possible existence of features lacking their attributes. While classification correctness is barely applicable to OSM quality evaluation due to unlimited range of possible attribute values and their infrequent provision (Al Bakri and Fairbairn, 2010). Attribute completeness was measured quantitatively as the proportion of the number of OSM buildings that are accompanied with its attribute to the total number of OSM buildings, in percentage. Two attributes were studied, namely building type and building proper name. The results are presented in tab.3. Table 3: The results of the attribute completeness study based on OSM buildings within the test sub-areas. attribute completeness OSM buildings in total - building type building name rural district 2,566 32.2% 0.89% urban district 21,434 76.4% 0.47% The results of the attribute completeness study confirm the heterogeneous OSM quality across the test site. More developed areas receive more than twice as much contributions as rural ones, as far as the attribute of building type is concerned. It may be associated with a weaker need for knowledge of how buildings are used in rural areas, where generally there are few service buildings and their position is well known to local people (i.e. locals do not need a map to get there). While the attribute that carries information about the proper name of the building is practically not provided; its completeness is below 1% (Table 3). Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | | 499 |60/3| GEODETSKI VESTNIK Figure 1: The 'building' features tagged based on their type making up at least 0.2% of the total share as registered in OSM for the urban (blue) and rural district (red). The list of key values used to tag building type/use includes 10 and 14 items, contributing to at least 0.2% of total share, for the Siedlce district and the town of Siedlce respectively (Figure 1). As many as 67.8% of the buildings located in the rural district and 23.6% in the urban district have the 'yes' tag, which means no information about their use or function. The attribute deficiencies in buildings featured in the OpenStreetMap database may result from the OSM data collection methods. Often, they cannot be detect based only on satellite or aerial images. Similarly, it is not easy to determine the function of a building observing it form the street level. A high rise can serve as an apartment building, an office building or the seat of a museum of modern arts. Moreover, diverse construction customs - resulting from history or mandated by law in different countries - may distort one's visual assessment, particularly in the case of non-local observers. In view of the OSM attribute thematic accuracy analysis, the most frequently provided OSM building attribute was chosen. This attribute, referred here to as building type attribute, currently reflects the contributions as for the mapped building typology (i.e. physical nature of the building) or its intended (or original) function or its use (OSM, 2015b). Such a wide range of often contradictory roles prove the previously mentioned problem of the vagueness and ambiguousness of OSM building thematic data. Therefore, its attribute accuracy analysis is not straightforward and it requires OSM semantics better understanding. Consequently, semantic similarity analysis between OSM building data and the Polish Database ofTopographic Objects is on-going (the initial findings can be found at (Nowak da Costa, 2016)). For the purpose of this work, the attribute thematic accuracy analysis was carried out manually and therefore was limited in scope because it required creating time-consuming semantic correspondence Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O | 500 | STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | GEODETSKI VESTNIK |60/3| rules for each attribute. The attribute accuracy was evaluated in scrutiny for a small sample of 82 OSM ^ buildings that had their type attribute provided. The accuracy of this attribute was defined as the per- ¡= centage of the OSM buildings having their attribute equal or very similar to the adequate attribute of ^ the corresponding building feature within the reference dataset. For the studied OSM data sample, the ^ medium level performance of the type attribute accuracy, that is 78%, was obtained. 6 POSITIONAL ACCURACY Positional accuracy, the component of geometric accuracy, can be defined as a measure of the difference between the position of a distinct object as recorded in the database, and its true location on the ground (Goodchild and Hunter, 1997). Usually, this accuracy is assessed using a reference dataset of higher quality. In the study, the positional accuracy analysis was based on manual measurement of the corner points of building footprints within OSM dataset and their corresponding points within the BDOTlOk dataset. The measurements were performed on a fair random sample of OSM buildings, evenly distributed within the urban and rural test areas. On total 782 buildings were measured, where the average number of points measured per building was 5. Table 4: The accuracy results of the building positions. Total number of OSM Number of Minimum/maximum Mean position RMSE buildings matched with measured deviation [m] deviation [m] [m] reference buildings buildings rural district 2484 371 0.2/9.4 1.3 1.69 urban district 17917 411 0.1/6.5 0.3 0.59 The set of obtained positon differences is characterised by high discrepancies; the minimum deviation is practically equal to zero, while the maximum reaches almost 10 meters (Table 4). Such heterogeneity is attributed to the variety of methods used by VGI data collectors, their skills and experience. The positional accuracy was quantified using a traditional statistical measure, the Root Mean Square Error (RMSE). On average, the OpenStreetMap data is within 0.6 and 1.7 meters of the position recorded in the Polish Database of Topographic Objects, for urban and rural test area respectively. The fact that there is practically no positional mismatch between buildings of the two tested data sources, created the basis for the automated matching method choice (see section 7.3). Moreover, the study confirmed that the positional quality of OSM building data is related to urbanization level. In the rural test area, the OSM data quality is, on average, three times worse than in the urban area. The research reveals surprisingly high positional accuracy of OSM building features, which technically exceeds the accuracy of common handheld GPS receivers or accuracy of available images' amateur vec-torization. This may indicate that a part of data was imported in digital form from other spatial databases characterized by high detail level and accuracy. 7 FEATURE COMPLETENESS Data completeness refers to an indication of whether or not all the data, i.e. features, their attributes and their relationships, are available in the data resource. This chapter focuses on building feature completeness. Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | | 501 |60/3| GEODETSKI VESTNIK ^ 7.1 Data pre-treatment sg The BDOTlOk building data includes all residential and non-residential buildings with the exception 5 of small objects with an area smaller than 40 m2; however, in the case of small but interconnected structures sharing the same function (e.g. detached garages), they are aggregated and included in if the dataset (MSWiA, 2011). In order to ensure comparability across the two studied datasets, the ^ analogous logical constraint was applied to the OSM building data (excluding small features except g the detached ones). Furthermore, as mentioned in section 5, the author investigated the conceptual fuzziness of the OSM features tagged 'building and their typologies (building:type=*).To minimize differences at the semantic level, the reference and the tested datasets have been narrowed down so that they include mostly buildings related to human habitation, educational, healthcare and religious buildings, commercial and main industrial buildings, car garages and sport facilities. 7.2 Choice of measures The extensive study by Hecht et al. (2013) presents the two significant object-based approaches, the centroid and overlap method, for measuring building completeness using an extrinsic method; and the ^ level of data completeness is determined as a proportion of the corresponding reference buildings to the total set of referenced buildings. These methods are barely sensitive to disparities in object modelling between official and VGI data; however, the official German dataset was considered as the only legitimate reference. To avoid arbitrary outclassing one of the datasets to be compared for feature completeness, a novel rule of benchmark data lack is introduced here. The assumption of a symmetrical relationship between two datasets, where only the presence or the absence of a specific property is considered, allowed for taking advantage of the resemblance measures used unremarkably by mathematicians, e.g. (Batagelj and Bren, 1995), statisticians, e.g. (Czekanowski, 1913; Gower and Legendre, 1986), and environmentalists, e.g. (Legendre and Legendre, 1998). One of the simplest, and also most frequently selected, coefficients that determine the degree of similarity between two sets is the Jaccard Index (Jaccard, 1901). The Jaccard Index is expressed as a quotient of the cardinality of sets intersection and the cardinality of sets unions. A Polish statistician, Czekanowski (1913), suggested a similar ratio. The Czekanowski's coefficient (also referred to as Bray-Curtis), however, gives more weight (i.e. importance) to the intersection of sets (here: the OSM features that have their counterparts in BDOT10k, and the other way round - based on symmetry assumption). Both coefficients range between 0 and 1, which facilitates comparisons and interpretation of results. The Jaccard's and Czekanowski's coefficients are defined on the cardinality of a set, which is equivalent to the number of elements in a set. Therefore, their direct adoption for the purpose of feature completeness assessment is greatly affected by the way the number of homologous objects are determined, and may yield confusing results (as depicted in Figure 2). Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O | 502 | STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | GEODETSKI VESTNIK | 60/3 | Matching feature number-based completeness indexes based on Jaccard Index (CJ) -ö M C '3 _Q O H O a M -O M C ¿2 '•j _Q NumK, S Bid _ NumOS Num S Bid _ NumBDOT' 2 Num S Bid _ NumfoSM + S B Num 'BDOT 21 - S Bid _ Numos 21 + 3 - 2 C * 100% = 9% 21 + 3 - 21 q * 100% = 700% Figure 2: An example of the application of the data completeness evaluation technique based on the Jaccard Index and the spatial objects' number. The key to the riddle of '700%' are modeling differences between the two sets under examination and, in particular, significantly different levels of data generalization. In the provided example (fig.2), which is based on the real OSM and BDOTlOk data from the test area, small adjoining garage building constitute 21 features in the OSM dataset, while in the reference set they are - in compliance with technical guidelines - adequately aggregated and they constitute only two features. If the number of OSM objects is adopted as the power of the homologous objects set, then its value greatly exceeds the overall number of objects in the reference set. This is an example of a failure to satisfy the axiomatic requirements for the application of resemblance measures. If, however, the equal importance is assigned to the area unit (e.g. lm2) instead of the feature unit, the requirements are met. Therefore, in the interest of universality, building area substitutes building number determining OSM building completeness (C ) as follows: C„ =- S Bid _ AreaO (1) Y Bld - Area0SM + Y Bld - AreaRef - Y Bld - Area0SM_ match The respective adaptation of Czekanowski's coefficient for the purposes of determining OSM building completeness (C) is: C = CzI : S Bid _ AreaO S Bid _ AreaOSM + S Bid _ Area (2) Ref Where: YBid Area„ h stands for the area of OSM buildings that fulfil the matching criterion, "LBld_Area - the total area of OSM buildings, and Y,Bld_AreaR is the total area of BDOTlOk buildings. Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN I H | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-5 NIZI O I 503 | |60/3| GEODETSKI VESTNIK ^ We called C^ and Ccd indexes the matching feature area-based completeness indexes based on Jaccard and Czekanowski index, respectively. They are barely sensitive to the building outline S modeling and to its degree of abstraction. The completeness indexes values for the example de-| picted in fig.2 are CJI*100% = 86% and CQa*100% = 92%. Observably, the coefficients (1) and =■ (2) yield similar results (compare also fig.3). The author is more inclined to the Ccd index based on Czekanowski idea since the calculated CCI index values are closer to the intuitive (visual) asH sessment of completeness. 7.3 Matching method In order to determine homologous objects in both datasets, an automated matching of buildings was carried out. Since the results of the positional accuracy analysis did not reveal significant spatial shifting between manually matched objects, therefore a spatial selection method based on centroid position can be applied as follows. The matching criterion is successfully met if OSM features have their centroid in a reference polygon or a reference feature's centroid lies within an OSM polygon. The main reason for choosing such algorithm was its simplicity and immediate availability; it uses the s common and simple GIS selection method based on spatial location. Moreover, this empiric matching method is proved to be barely sensitive to the discrepancies in the building outline modelling, and to the degree of abstraction, in particular. Its performance was tested on several data samples, extracted randomly from the given OSM building dataset, and compared with manually matched data samples. Only the resting 7% of the building data required manual intervention, mainly the ones characterized by having their geometric centre outside their footprint polygon (non-covex shaped polygon). Also its performance in computational terms is high, although dataset division into smaller sets is recommended. 7.4 Completeness and over-completeness The OSM completeness was analysed within the administrative borders of 13 communes of Siedlce district, and Siedlce town. The OSM completeness ratio, defined by C^ or Cz is reported in the form of choropleth map (Figure 3). The analysis of the obtained OSM building completeness ratio allows to determine the subsequent findings. The OSM building feature completeness is relatively high - that is C(JI of 95% - in the town centres ('M.Siedlce' labelled on fig.3), and its value decreases rapidly as you move away from urban centres. The least completed regions, CCI of less than 6%, correspond to rural areas with dispersed settlement. Furthermore, the author investigated the OSM buildings that do not meet the matching criterion. Within the rural test area, those buildings were the subject of visual inspection against both the satellite images shared in the global Internet (e.g. GoogleEarth) and the BDOTlOk dataset. The results are summarised in tab.5. Joanna Nowak Da Costa | PREUČEVANJE POPOLNOSTI PODATKOV NA PODLAGI PRIMERJALNE ANALIZE MED PODATKI VGI IN URADNIMI PODATKOVNIMI NIZI O | 504 | STAVBAH | NOVEL TOOL FOR EXAMINATION OF DATA COMPLETENESS BASED ON A COMPARATIVE STUDY OF VGI DATA AND OFFICIAL BUILDING DATASETS | 495-508 | GEODETSKI VESTNIK | 60/3 | Figure 3: A map of the OSM buildings completeness analysis results for the Siedleckie County as of May, 2015. Table 5: Typology of OSM buildings in the tested rural district that do not meet the matching criterion. Number of instances Percentage share [%] OSM buildings non existing in BDOTlOk dataset 32 61 OSM buildings belonging to other than the BDOTlOk building classes, e.g. roadside shrine, greenhouse 7 13 OSM incorrect building features (measurement error or data entry error) 14 26 It is worth noticing that 61% of objects included in the commission set are correctly registered as building features in the OSM database. They physically exist and they satisfy the technical criteria for BDOTlOk building features, nonetheless they are not included in BDOTlOk. This proves the OSM building dataset utility on the BDOTlOk updating. To estimate the size of OSM buildings excess (sometimes referred to as VGI over-completeness or data commission), a coefficient analogical to the CCI index is proposed, as follows: CEn : ^ Bld _ AreaO OSM NOmatch ^ Bld _ AreaOSM ^ Bld _ Area (3) Ref Where: Y,Bld_AreaQSM NCmatch stands for the area of OSM buildings that do not fulfil the matching criterion, YjBld_AreaOSM is the total area of OSM buildings, and Y