REPORT Project Acronym: EODOPEN Grant Agreement Number: 607666-CREA-1-2019-1-AT-CULT-COOP2 Project Title: EODOPEN | eBooks-On-Demand-Network Opening Publications for European Netizens Project Website: https://eodopen.eu/ A12 Evaluating Delivery Formats and Solutions D12b&c Report on Trial Implementations for Mobile Devices and Print-Disabled Users Author(s): Andreja Hari Alenka Kavčič Čolić Project Co-funded by the Creative Europe Programme of the European Union Dissemination Level P Public X C Confidential, only for members of the project partners and the Commission Services 1 DOCUMENT INFORMATION Activity number: A12 Activity title: Evaluating Delivery Formats and Solutions Contractual date of activity: 31.10.23 Actual date of activity: 01.01.22–31.03.23 Author(s): Andreja Hari, Alenka Kavčič Čolić (NUK) Contributor(s): Constantin Lehenmeier (UREG), Tina Glavič (NUK) Participant(s): EODOPEN-project members Working group: WG4 Working group title: Delivery Formats of Digitised Material for Special Needs Working group leader: Alenka Kavčič Čolić Dissemination Level: P 2 HISTORY OF VERSIONS Description/Approval Version Date Status Author (organisation) Level Andreja Hari (NUK) Alenka Kavčič Čolić (NUK) 27.07.22– 1 Draft [Tina Glavič (NUK) co- First draft 09.08.22 author of the methodological approach] Sent to partners for additional changes. Andreja Hari (NUK) Corrections and 2 21.11.22 Draft Alenka Kavčič Čolić (NUK) suggestions received from Constantin Lehenmeier (UREG). Andreja Hari (NUK) 3 09.03.23 Draft Final draft Alenka Kavčič Čolić (NUK) 4 07.06.23 Draft Proofreading completed Final draft Delivered for peer review 5 19.06.23 Draft Final draft to project partners Edited after review and 6 29.09.23 Final Completed finalised 3 EODOPEN PROJECT SUMMARY Libraries all over Europe face the difficult challenge of managing tremendous amounts of 20th and 21st century textual material that has not yet been digitised due to the complex copyright situation. These works cannot be accessed by the general public and are hidden deep in library stacks, as they are often out of print or have never even been printed at all, while reprints or facsimiles are out of sight. The EODOPEN project focuses on making 20th and 21st century library collections digitally visible by directly engaging with communities in the selection, digitisation and dissemination processes. As a leading partner, the University Library of Innsbruck, joined by 14 European libraries from 11 nations, has set itself the goal of making 15,000 pieces of textual material digitally available, and of reaching more than one million people in Europe by 2024. Among other goals, such as building a common portal to display the project outcomes, EODOPEN aims to stimulate interest in and improve access to 20th and 21st century textual material, including grey and scientific literature. EODOPEN continuously carries out social media campaigns in order to attract new audiences. Furthermore, the participating libraries establish contact with commemorative institutions all over Europe, as well as with researchers and doctoral study boards, history associations and local publishing houses, in order to obtain suggestions from a broad audience. In collaboration with local institutions, all of the project partners select hidden library treasures, deal with rights clearance questions and put new content online, while dissemination activities display the digital content via international channels. In addition, EODOPEN aims to provide alternative delivery formats suitable for blind or visually impaired users. An international survey gathers data from a broad European public about the use of e-books. By evaluating this data, the project broadens its scope to alternative delivery formats in order to fulfil the needs of blind or visually impaired users. In order to promote best practice in rights clearance among the library community, EODOPEN provides handouts and tools to make 20th and 21st century books available beyond the project’s lifetime. In this regard, the project partners cooperate closely to develop an online tool for the documentation of rights clearance, especially suited for out-of-print and orphan works. Interactive workshops investigate needs related to dealing with rights clearance questions in order to implement the requirements of the international community in establishing the online tool. 4 ABSTRACT The aim of the Report on Trial Implementations for Mobile Devices and Print-Disabled Users (hereinafter: the Report) is to help libraries and other cultural organisations to make digitised content available to a broader community. The Report is based on EODOPEN partners’ digitisation experiences at their organisations and complements the EODOPEN Project Deliverable 11: Guidelines and Recommendations for the Provision of Alternative and Special Formats, which addresses delivery formats and criteria for increasing the quality of digitisation results for users of mobile devices and blind and partially sighted users. The Report presents the results of a trial implementation among EODOPEN partners on their digitisation workflows, the delivery file formats used and, consequently, the quality of optical character recognition (OCR) results, depending on file format type and accessibility criteria. Statement of originality: This report contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both. 5 TABLE OF CONTENTS DOCUMENT INFORMATION ............................................................................................ 2 HISTORY OF VERSIONS .................................................................................................... 3 EODOPEN PROJECT SUMMARY ....................................................................................... 4 ABSTRACT ....................................................................................................................... 5 TABLE OF CONTENTS ....................................................................................................... 6 LIST OF TABLES ............................................................................................................... 8 1. Introduction ............................................................................................................. 9 1.1. Purpose ...................................................................................................................... 9 1.2 Description of the Report ........................................................................................... 9 1.3 Explanation of the key concepts ................................................................................ 9 1.4 Context description: User needs for delivery formats ............................................. 10 2 Evaluation of delivery formats: Trial implementation ............................................. 12 2.1 Background .............................................................................................................. 12 2.2 Methodological approach........................................................................................ 13 3 Test results ............................................................................................................ 21 3.1 Results by sample scan number ............................................................................... 25 3.2 Results according to criteria .................................................................................... 36 4 Test findings .......................................................................................................... 45 5 Possible solutions and recommendations ............................................................... 47 5.1 Solutions for mobile devices .................................................................................... 47 5.2 Solutions for print-disabled users ............................................................................ 48 11. Summary ............................................................................................................ 50 12. References ......................................................................................................... 52 13. Vocabulary ......................................................................................................... 53 14. Used acronyms ................................................................................................... 54 15. Annexes ............................................................................................................. 55 Annex 1. Testing samples ................................................................................................... 55 Scan 1 .............................................................................................................................. 56 Scan 2 .............................................................................................................................. 57 Scan 3 .............................................................................................................................. 58 Scan 4 .............................................................................................................................. 59 Scan 5 .............................................................................................................................. 60 Scan 6 .............................................................................................................................. 61 Scan 7 .............................................................................................................................. 62 Scan 8 .............................................................................................................................. 63 Scan 9 .............................................................................................................................. 64 Scan 10 ............................................................................................................................ 65 6 Scan 11 ............................................................................................................................ 66 Scan 12 ............................................................................................................................ 67 Scan 13 ............................................................................................................................ 68 Scan 14 ............................................................................................................................ 69 Scan 15 ............................................................................................................................ 70 Scan 16 ............................................................................................................................ 71 Annex 2: Testing report questionnaire ............................................................................... 72 Annex 3. Testing report questionnaires by partners .......................................................... 74 P1 - UIBK, University of Innsbruck – RTF format ............................................................ 75 P1 - UIBK, University of Innsbruck – ODM workflow – PDF and RTF format .................. 77 P2 - UT, University of Tartu ............................................................................................. 79 P3 - NUK, National and University Library – usual workflow ......................................... 81 P3 - NUK, National and University Library – PDF edited with Adobe Acrobat Pro ......... 83 P3 - NUK, National and University Library – PDF made with Abbyy FineReader 15 ...... 85 P3 - NUK, National and University Library – PDF and ePUB made from Word .............. 87 P4 - MZK, Moravian Library – small edited ..................................................................... 89 P4 - MZK, Moravian Library – edited .............................................................................. 91 P5 - UG, University of Greifswald.................................................................................... 93 P6 - NLS, National Library of Sweden.............................................................................. 96 P7 - NCU, Nicolaus Copernicus University in Torun ........................................................ 98 P9 - VKOL, Research Library Olomouc .......................................................................... 100 P10 - BNP, National Library of Portugal – PDF .............................................................. 102 P10 - BNP, National Library of Portugal – docx ............................................................ 104 P11 - NLE, National Library of Estonia .......................................................................... 106 P12 - OSZK, National Széchényi Library ........................................................................ 108 P13 - CVTI SR, Slovak Centre of Scientific and Technical Information .......................... 111 P14 - UREG, University of Regensburg .......................................................................... 113 7 LIST OF TABLES Table 1: The appearance of the criteria in the 16 scans of the test sample. ......................... 18 Table 2: Software overview for automatically generated outputs ........................................ 21 Table 3: Software overview for outputs containing additional manual corrections ............. 22 Table 4: Results of Scan 1 ....................................................................................................... 25 Table 5: Results of Scan 2 ....................................................................................................... 26 Table 6: Results of Scan 3 ....................................................................................................... 26 Table 7: Results of Scan 4 ....................................................................................................... 27 Table 8: Results of Scan 5 ....................................................................................................... 27 Table 9: Results of Scan 6 ....................................................................................................... 28 Table 10: Results of Scan 7 ..................................................................................................... 29 Table 11: Results of Scan 8 ..................................................................................................... 30 Table 12: Results of Scan 9 ..................................................................................................... 31 Table 13: Results of Scan 10 ................................................................................................... 31 Table 14: Results of Scan 11 ................................................................................................... 32 Table 15: Results of Scan 12 ................................................................................................... 33 Table 16: Results of Scan 13 ................................................................................................... 33 Table 17: Results of Scan 14 ................................................................................................... 34 Table 18: Results of Scan 15 ................................................................................................... 34 Table 19: Results of Scan 16 ................................................................................................... 34 Table 20: Results according to criteria for the automatically generated outputs ................. 36 Table 21: Results according to criteria for the outputs with additional manual corrections 37 8 1. Introduction 1.1. Purpose The aim of the Report on Trial Implementations for Mobile Devices and Print-Disabled Users (hereinafter: the Report) is to help libraries and other cultural organisations in the field of culture to make digitised content available to a broader communities. The Report is based on EODOPEN partners’ digitisation experiences at their organisations and complements the EODOPEN Project Deliverable 11: Guidelines and Recommendations for the Provision of Alternative and Special Formats, which is based on a survey on the special needs of users, and technical requirements concerning regarding delivery formats and criteria for increasing the quality of digitisation results for users of mobile devices, as well as for blind and partially sighted users. The Report gathers experiences from all EODOPEN consortia partners. 1.2 Description of the Report The Report presents the results of a trial implementation among EODOPEN partners. It comprises a brief introduction followed by a description of the methodology and the test results, and it concludes with the findings and some recommendations. The introductory chapter (Chapter 1) describes the Report’s purpose, scope and key concepts used in the text, and defines the user needs for mobile users, as well as for blind and partially sighted users. Chapter 2 provides the background, the methodological approach, a description of the test sample, and a definition of the evaluation criteria of delivery formats. Chapter 3 presents partners’ test results. This is followed by a discussion of the findings (Chapter 4) and some recommendations (Chapter 5). The Report is accompanied by a list of literature sources and recommended references, as well as a definition of the terms used and a list of acronyms. All of the samples and test report questionnaires are attached to this document in the annexes. 1.3 Explanation of the key concepts In the Report, mobile devices are defined as smartphones, notebooks and tablet computers, as well as e-readers. In accordance with the recommendation of the European Blind Union (EBU), the term blind and partially sighted users is used instead of the term blind and visually impaired users. Although the term print disability covers a wide range of disabilities or problems related to reading text, the Report focuses solely on the blind and partially sighted, which is one of the project’s primary groups. Digitisation means digital conversion of information on analogue carriers. Target communities are people who access digitised content in libraries and other cultural organisations. The term eBook usually refers to born-digital publications, but in this document it is used refer to digital publications produced as 9 a result of digital conversion, including formats for special needs (audiobooks), which is one of the objectives of the EODOPEN project. However, this term does not exclude born-digital publications, as the delivery format is the same or has the same purpose or functions. eBooks can be accessible through e-readers or can simply be read on personal computers (PCs) or mobile devices such as smartphones, tablets or notebooks. 1.4 Context description: User needs for delivery formats Mobile devices are integral tools of the global information society and smartphone technology is already part of our everyday life, enabling constant interconnection with other tools and people through various networks and social media. The development of mobile devices also has a strong impact on the development of their operating systems and tools, which is something that the service sector, including libraries and cultural organisations, should always bear in mind. It is therefore very important to plan and publish content in file formats that are and will continue to be supported by these devices. eBooks can be read on different kinds of mobile devices, such as e-readers (Kindle, Kobo, Midia Inkbook, NOOK, etc.), smartphones, tablets and portable computers (notebooks). The selection of file format delivery and/or access depends on the type of device (size of screen, visual presentation) and the existing platform (Microsoft, Android and iOS are the most commonly used). Although there are no problems associated with accessing PDF files through devices with bigger screens, they not a recommended for smaller devices like smartphones or e-readers because PDF it is not a responsive file format. The most recommended formats for smaller devices are ePUB 3, AZW/MOBI, HTML, Microsoft Office Word documents (RTF, docx, etc.) or audio books (mp3, DAISY). Some platforms only support certain types of formats. For instance, Kindle e-readers did not originally support ePUB format, so users had to convert ePUB files to AZW/MOBI format before uploading them to their devices. Recently, however, due to new developments at Amazon, file format conversion to ePUB has been made available to users, while the obsolete MOBI format is no longer supported (Amazon, s.a.). With regard to blind and partially sighted users, as well as for other print-disabled users, it is important to consider the degree to which the user can use his/her sight and its variation from day to day or due to light conditions, tiredness or stress levels, etc. It is therefore important to consider how to provide users with the possibility to adapt the visual presentation of the text to fit their needs. Some of the common challenges faced primarily by partially sighted people are difficulty in focusing on the text, reduced contrast sensitivity, reduced field of vision, sensitivity to movement, visual fatigue and similar. For the users, the most useful adjustments are adjustments in font size, font type, colour themes, margins and spacing. The option to access the full text (where optical character recognition (OCR) is 10 preferably checked) is also important, as it enables the use of assistive technologies (such as braille display or screen readers). Although blind and partially sighted users mostly access documents through bigger screens, they are also avid users of smaller mobile devices. The most recommended formats for this group of users are Microsoft Office Word documents (RTF, docx, etc.), audio books (mp3, DAISY), HTML, ePUB 3 and AZW/MOBI, but tagged PDF format is also suitable: “PDF tags are the key to accessing a PDF document’s content with assistive technologies such as screen readers. When a tagged PDF is created, each page element in the document is ‘tagged’. Each tag identifies the type of content and stores some attributes about it. They also arrange the document content into a hierarchical architecture (or a ‘tag tree’). The tag tree forms the logical structure of the document (reading order).” (Accessible document solutions, n.d.) For a more comprehensive overview of user needs for delivery formats, see Deliverable 11: Guidelines and Recommendations for the Provision of Alternative and Special Formats, which is based on a survey on the special needs of users and technical requirements. 11 2 Evaluation of delivery formats: Trial implementation 2.1 Background A questionnaire survey conducted among EODOPEN project partners in 2020 revealed that libraries use various devices and software tools in the digitisation process. Digital conversion is automated or carried out in different phases and depends on financial resources as well as on adequately trained personnel. To ensure the best possible results, it combines a variety of technological and software solutions, resulting in a diverse range of digitised material. This material is available to users via digital libraries in various delivery formats, which also differ from each other in terms of the functionalities provided. In the Guidelines and Recommendations for the Provision of Alternative and Special Formats (Deliverable D11), which were prepared within the framework of project’s working group 4,1 special emphasis is placed on the possibilities and ways of adapting digitised material to make it available in formats that ensure accessibility to blind and partially sighted users. According to research, the appropriate structuring of a text and its elements is crucial for reading digital material, as it enables navigation through the text. Text navigation, recognition of text and graphic elements, and the ability to personalise settings are even more important for blind and partially sighted people, who use assistive technology and dedicated software in order to read. Optical character recognition (OCR) tools and their software modifications enable optical recognition of characters – letters, numbers, punctuation marks – as well as text structures. Machine learning technology has advanced to the point where errors in OCR are negligible. OCR recognition errors mainly occur when reading special characters, such as chemical formulas, mathematical operations and equations, although errors occur also in identifying headings, sub-headings and graphical elements in the text (images, graphs). Furthermore, in cases of more than one text column, the text flow is often not recognised correctly, with the linear sequence of the text appearing instead. If we assume the position of a blind person who uses speech synthesis to read, a text is unreadable without the proper interpretation of special characters, the specific sequence of the text, and the graphic elements with their corresponding descriptions. Perception, operability, understanding and robustness – defined by the World Wide Web Consortium (W3C) through the Web Accessibility Initiative (WAI), as part of the Web Content Accessibility Guidelines (WCAG) – are the umbrella criteria for making websites and digital material accessible to blind and partially sighted users, as well as other groups of users with disabilities. Within the framework of the aforementioned working group, we sought to 1 The working group is led by the National and University Library (Slovenia). 12 approach these accessibility criteria, which also apply to born-digital material or e-books. The objectives of the working group were: • to develop a common test sample (a selection of scans), including as many different textual and graphic elements as possible; • to test the sample in the further digitisation process by all partner institutions; • to create representative samples based on the test sample, using various tools and attempting to meet the criteria of the Web Content Accessibility Guidelines (WCAG) in one case; • to compare all of the received results based on set accessibility criteria and thus identify the most appropriate solutions; • to obtain more detailed information on digitisation workflows in partner institutions; • to identify workflows and digitisation phases that allow segmentation and identification of textual and graphical elements with all of their properties. The purpose of the testing was to identify the best solutions in the digitisation process, and to determine whether there is any further room for improvement in the provision of digitised materials that meet accessibility criteria. This would allow libraries to review and, depending on the resources provided, improve digitisation workflows and user services. As mentioned above, sighted people do not need such precise processing of texts to be able to access the content of digitised works. Nonetheless, responsive technologies are also based on accurate OCR and sighted people also use screen readers that enable text to be read aloud to them. The entire testing was therefore based on criteria that are essential for the blind and partially sighted, thus following the principle of universal design (for everyone). In Chapter 5 of this document, possible solutions and recommendations are presented both for mobile device users and print-disabled users, in case libraries want to focus on just one group of users. The use of solutions for print-disabled users is, however, recommended. 2.2 Methodological approach The test phase was conducted between February and July 2022 at all of the partner institutions. EODOPEN partners received a test sample (see Annex 1) and a blank test report questionnaire (see Annex 2) on which they reported the work done with the test sample. The aim of the testing was to find out which scanning and recognition workflows are optimal for achieving the best results in OCR, as well as to determine which file formats can be generated, as different file formats can provide users with different user experiences. For this purpose, it was decided that all of the partners would test the same samples containing English text, as none of the EODOPEN partners are located in regions where English is the native language. Using the same scans with English text would facilitate the 13 comparative analysis of the results. In addition, some OCR tools are better adapted to majority languages (e.g., German), and we wanted to avoid discrimination of minority languages such as Slovenian, Estonian or Slovak. Partners could subsequently conduct the same analysis on scans in their own language for additional testing of their systems. The test sample consisted of 16 scans in TIFF format (see Annex 1), comprising both textual and non-textual elements, such as plain text, chapters and sub-chapters, columns, tables, footnotes, flowcharts, images and text accompanying images (captions). The special examples in the texts were chemical formulas, mathematical equations and special characters (£, °C, etc.). Two of the scans contained two pages on one: in the first scan, the title and the name of the author were spread across both pages, while the second scan contained the chemical periodical table spread across both pages. Most of the scans had a complicated structure with elements that could disturb the text order (e.g., captions) or create problems with element recognition (e.g., tables). Only three of the scans (8, 9, 13) had basic layout with text in one column and a picture, which would not be expected to cause difficulties with regard to reading order. The partners used the test sample in their usual digitisation workflows, conducting the process from scan processing to the creation of the most common delivery format available in their digital library. The results were returned to the testing team at the National and University Library (Slovenia). The test report questionnaire (see Annex 2) consisted of 14 questions enabling the project partners to record the work processes, software tools and solutions used when testing the sample. Reviewing the reports enabled us to learn more about the different stages of the digitisation workflow: scan import, image processing, OCR options (multilevel document analysis and recognition of elements), any additional processing, and exporting the final delivery format. As digitisation processes are diverse, the questionnaire provides a framework enabling us to gain an insight into the workflows of the individual institutions, especially with regard to the stages and levels of the digitisation process that lead to meeting the WCAG criteria, or that bring better digitisation outputs for the end users. For the evaluation of the outputs, 24 criteria were prepared based on WCAG for the optimal accessibility of the documents and other best practice guidelines, focusing primarily on accessibility for blind and partially sighted. The criteria were established separately for each scan, as they were not all applicable to all of the scans. Moreover, some of the criteria were specific to individual scans, as they can produce different results during the OCR process (e.g., page rotation and pagination – double). The criteria used to evaluate each output of the digitisation process were: 14 1. ALT-TEXT PICTURE – Alt-text or alternative text for pictures provides a textual description for non-text content (pictures, graphics, diagrams, etc.). These are elements that enable mostly blind users, but also partially sighted users, to know the content of the graphic material, so that they do not miss any information that the graphic material may be trying to convey. This criterion is primarily important to the blind and partially sighted, but could also be useful to sighted users using speech synthesis. 2. ALT-TEXT PICTURE (CHEMICAL FORMULA) – Same as the criterion alt-text picture, but used for the two special images in the test sample that presented molecular reactions (see Scan 2 in Annex 1). 3. CAPTION – Some of the images and tables in the scans contained captions. In the document, it should be indicated that the text is a caption associated with a picture and not general paragraph text.2 This criterion is primarily important for the blind and partially sighted. 4. FOOTNOTES – Footnotes are elements in a document that provide additional information related to the main text and should be technically separated from the main text, thus giving readers the option of skipping them. When creating or editing footnotes, the result should enable the reader to jump from the main text to the footnote and then back to the same area in the text.3 This criterion is mainly important for the blind and partially sighted. 5. HEADING 1 – Mainly for navigational purposes, the headings of the chapters should be marked and structured in depth (Heading 1, 2, 3, etc.). Headings can also be used to form a table of contents. This enables users of assistive technologies to skip from chapter to chapter more easily, and thus to navigate within the document instead of reading the whole document. This criterion is important for all users. 6. HEADING 2 – See criterion Heading 1 7. HEADING 3 – See criterion Heading 1 8. INITIAL – A larger first letter at the beginning of a chapter is often not recognised or not recognised correctly (see Scan 7 in Annex 1). This criterion is important for all users. 9. LANGUAGE SEGMENTS – See the criterion Primary Language. The Language Segments criterion was used on six different occasions in the test sample (Italian + Latin, Italian, French twice and German twice) where text appeared in a language other than English, which was the primary language. The language is important for users of screen reading technologies in which voice settings can be switched to the correct audio to provide 2 Captions can be inserted technically. In tagged PDFs, for example, a specific tag can be added in Adobe Acrobat Pro. When working in Microsoft Word, the “insert caption” option can be used. 3 Good results can be achieved, for example, in Microsoft Word, HTML or ePUB by providing two-way hyperlinks. 15 proper pronunciation.4 This criterion is primarily important for the blind and partially sighted, but could also be useful to sighted users using speech synthesis. 10. MATH (SIMPLE) – The recognition of mathematical or chemical elements was divided into two criteria, as it is mainly simple mathematical elements that appear in one single line that create less problems for OCR (example from the test sample: 𝑒 = 𝑒′ − 𝐴𝐵(𝑡 − 𝑡′)) than advanced math which appears in more than one line. This criterion is primarily important for the blind and partially sighted. 11. MATH (ADVANCED) – The second criterion for mathematical and chemical elements covers all expressions that appear in two or more lines. These elements are not usually recognised correctly during OCR. This criterion includes all elements with subscripts or superscripts (examples from the test sample: 𝑥2, 2𝐻2𝑂, 10−4, 𝐶6𝐻12𝑂6), fractions 𝑥 (example from the test sample: ) or even more complicated expressions (examples from 3 2𝑇𝜌 the test sample: ∆𝑝 = 𝜌 𝑣 𝑣𝑔ℎ or ∆𝑝 = ). The examples from the test sample 𝑅(𝜌𝑤−𝜌𝑣) contain various problematic elements (e.g., subscripts, superscripts, Greek letters and fractions). This criterion is primarily important for the blind and partially sighted. 12. OCR ERRORS (TEXT IN PICTURE 4) – One image showed text written on a tombstone (see Scan 7 in Annex 1). Ideally, text of this kind would not be recognised, but the goal was to see what kind of results would be obtained. This criterion is important for all users. 13. PAGE ROTATION – This criterion was only used in one case where a table appeared horizontally on a page. For better OCR and structure results, the page could be turned so that the table would face the reader correctly. This criterion is important for all users. 14. PAGINATION – This criterion was created for the purposes of the blind and partially sighted. Practice shows that blind and partially sighted users prefer the pagination to be the first information they receive when entering a page. When working on text order, the preference is for pagination to be the first information received, even if it actually appears at the bottom of the page. This criterion is primarily important for the blind and partially sighted, but could also be useful to sighted users using speech synthesis, or for easier navigation to the specific page in the document. 15. PAGINATION–DOUBLE – This criterion was used in two different cases when content appeared stretched across two pages. The first case involved an image of the periodic table of elements, while the second case concerned the title and author of the article, which were stretched across two pages. In both cases, better results would be obtained if the pages were not split. This criterion is important for all users. 16. PICTURE – A graphic element that should be marked as a separate element and contain alt-text for users of assistive technologies. This criterion is primarily important for the blind and partially sighted. 4 For example, a German text that is read aloud with an English voice sounds strange. 16 17. PICTURE (CHEM. FORMULA) – Same as the criterion Picture. This was a separate criterion for two images that presented molecular reactions, which should also contain alt-text. This criterion is primarily important for the blind and partially sighted. 18. PRIMARY LANGUAGE – The Primary Language should be set for each document. This is important for users of screen reading technologies that provide sound in the correct language. The text in the test sample was in English, so the Primary Language should be set to English. This criterion is mainly important for the blind and partially sighted, but could also be useful for sighted users using speech synthesis. 19. SPECIAL CHARACTER – This criterion appeared in three different cases (°C, £ and decimal numbers). The goal was to determine the number of examples in which there would be problems recognising the first two characters. In the scan with decimal numbers, the numbers are written with an apostrophe (‘), which the English vocabulary fails to recognise because full stops (.) are normally used for decimal numbers in English. The scan was tested to see whether we would receive any correct results. This criterion is important for all users. 20. STAMP REMOVAL – Library stamps in books can affect the recognition of nearby characters. The goal was to determine whether removing the stamp from the scan would ensure clearer OCR in that area. In our example, the stamp was directly over the text, and we assumed that it would cause bad OCR results. This criterion is important for all users. 21. TABLE – This is a structural element that should be technically marked and should not appear as an image only. Following the structure, the table header and table rows should also be present.5 This criterion is primarily important for the blind and partially sighted, but could also be useful to sighted users using speech synthesis. 22. TABLE HEADER – This is an element of a table that usually appears at the top of the table, but can also be in the first column of the table. It provides the main information about the data in the rows following it, and it is important for users of assistive technologies for easier navigation and understanding of the table. This criterion is primarily important for the blind and partially sighted, but could also be useful to sighted users using speech synthesis. 23. TABLE ROWS – These are structural elements following the table header. For the test sample, which did not contain a grid to mark the lines in the table, it was interesting to see whether the rows had technical data inserted and how well the OCR tool could recognise the number of rows. This criterion is primarily important for the blind and partially sighted, but could also be useful to sighted users using speech synthesis. 5 The structure of the table can be created technically. For example, in tagged PDFs, tags appear for table, table header, table rows and table data, much like in HTML formatting. Microsoft Word, for instance, also has the option to set a table header. 17 24. TEXT ORDER – This criterion establishes the flow of the text, especially when the structure on the page is more complicated (e.g., columns and additional graphical elements). When users copy text, convert the format or use assistive technology, it is important that the text is presented in the right order so as to prevent confusion (e.g., if a caption appears in the middle of a paragraph) or to avoid burdening users with the additional work of editing the content themselves. Some software tools for OCR also enable correcting the order of the recognised elements.6 Furthermore, assistive technologies provide users with text linearly from top to bottom, so the text order is crucial for understanding and navigating the content. This criterion is important for all users. Table 1 provides an overview of the appearance of these 24 criteria in the test samples by scan number. Table 1: The appearance of the criteria in the 16 scans of the test sample. Criteria\Scan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 = alt-text picture 1 1 1 1 3 2 4 1 1 3 18 alt-text picture 2 2 (chem. formula) caption 1 1 1 1 3 2 4 1 2 1 1 1 19 footnotes 1 1 heading 1 1 1 1 1 1 1 1 7 heading 2 1 1 2 4 1 1 10 heading 3 1 1 initial 1 1 language 1 1 1 1 1 1 6 segments math (simple) 1 1 1 3 math 1 1 1 1 4 (advanced) OCR errors (text 1 1 in picture 4) page rotation 1 1 pagination 1 1 1 1 1 1 1 1 1 1 1 1 12 pagination– 1 1 2 double picture 1 1 1 1 3 2 4 1 1 3 18 picture (chem. 2 2 formula) primary 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 language 6 The most frequently used software for OCR – Abbyy FineReader desktop version – has this option during processing the digitised content. For post-processing, an example of software of this kind is Adobe Acrobat Pro. 18 Criteria\Scan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 = special 1 1 1 3 character stamp removal 1 1 table 1 2 1 4 table header 1 2 1 4 table rows 1 2 1 4 text order 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 = 11 13 8 7 14 14 20 3 6 9 12 11 9 8 3 8 Three levels were used to evaluate the set criteria: • criterion was fully achieved (A): used if the technical and content part of the criterion was achieved. Example: table rows were technically correct (each row contained the right number of rows and the correct content). • criterion was partly achieved (B): used if either the technical or the content part of the criterion was achieved, but not both, or if there was a very minor mistake in the criterion. Example 1: alt-text is technically correct, but the content is either the text of the caption or other surrounding text. Example 2: there was a minor mistake in the text order. • criterion was not achieved (blank cell): used if neither the technical nor the content part of the criterion was achieved. Example: pagination was present, but was not the first element on the page. The evaluation was undertaken using various software and tools according to different output formats: • DROID – with this software, the versions of the format (PDF 1.4 or RTF 1.9) were determined; • Adobe Acrobat Reader Pro – with this software, the content of PDFs was checked, as well as the reading order and structure (tags) when the PDF was tagged; • PDF Accessibility Checker 2021 – with this software, we checked what kind of errors were found in the PDF file according to the standards and whether the language of the document was set; when the PDF was tagged, the reading order and structure (tags) were also checked; • Thorium Reader – with this software, we checked the content of an ePUB file and determined which options it enables with regard to visual adjustments and navigation (if a table of contents was available within the software); • Sigil – with this software, the content of an ePUB file was checked, as well as the reading order and structure (HTML tags); • Microsoft Office Word – with this software, the content of docx and RTF files was checked; • Notepad – with this software, the content of TXT files was checked; 19 • Windows Narrator – this speech synthesis software was used only in special cases to check how the content is provided to the user. 20 3 Test results A total of 23 test outputs were received from 13 partner institutions. These include automatically generated outputs (17), as well as outputs containing additional manual corrections (6). The software packages used for testing the samples were: ABBYY FineReader, ABBYY FineReader 11, ABBY Recognition server 4, ABBY Recognition server 14, ScanGate by Treventus Mechatronics, ABBYY FineReader PDF 15 Standard, Abbyy Finereader 15 desktop version, Adobe Acrobat Pro, IRIS OCR, LIMB processing, Microsoft Office Word, Scan Tailor Advanced v1.01.16, Tesseract 5.0.0-beta-20210815-22-g386dd, Photoshop 23.2.2., Project PERO OCR and WordToEpub (refer to Table 2). Table 2: Software overview for automatically generated outputs No. PARTNER SOFTWARE USED GENERATED FORMATS 1 UIBK ODM - Abbyy FineReader recognition server 4 PDF 2 UIBK ODM - Abbyy FineReader recognition server 4 PDF/A 3 UIBK ODM - Abbyy FineReader recognition server 4 RTF ABBYY FineReader PDF 15 Standard; ABBYY FineReader 4 UT PDF Server 14 5 NUK Abbyy FineReader PDF Page and Alto format MZK – small 6 Project Pero OCR (+TXT with plain text) – edited TXT tested Page and Alto format MZK - 7 Project Pero OCR (+TXT with plain text) – edited TXT tested 8 UG Abbyy FineReader PDF 9 UG Abbyy FineReader EPUB 10 NLS Abbyy FineReader 11, Limb Processing PDF 11 NCU Abbyy FineReader Server 14.0 PDF/UA xml + PDF (no OCR) and ScanTailor Advanced v1.01.16, Tesseract 5.0.0-beta- txt 12 VKOL 20210815-22-g386dd link shared to digital library – tested TXT 13 BNP LIMB Processing, IRIS OCR PDF For books files: Abbyy FineReader 11, Abbyy Recognition Server 4. 14 NLE PDF/A For newspaper/periodicals: Abbyy FineReader 11, CCS docWorks 7.1.0.90, Abbyy FineReader 12 OCR-engine Scans: 1-6, 8-11, 14-15: ScanTailor Advanced (1.0.16), Photoshop (v 23.2.2), Abbyy Recognition Server 4.0 Scans: 7, 13, 16: Photoshop (v 23.2.2), Abbyy Recognition 15 OSZK PDF Server 4.0 Scan 12: ScanTailor Advanced (1.0.16), Abbyy Recognition Server 4.0 21 No. PARTNER SOFTWARE USED GENERATED FORMATS ScanGate by Treventus Mechatronics, Abbyy Recognition 16 CVTI SR PDF Server 4.0 17 UREG Abbyy Recognition Server 4.0 PDF In the outputs with additional manual corrections, Microsoft Office Word (PDF 1.7, RTF 1.9, docx and ePUB 3.0) was mostly used for editing OCR errors and adding structural elements. In one test output, Adobe InDesign was used to edit headers, captions, original page numbers and footnotes. In another test output, the automatically generated PDF was additionally manually processed with Adobe Acrobat Pro, which tagged the content and edited the document’s reading order (see Picture 1). In a received output processed with the latest desktop version of Abbyy FineReader 15, the page elements were additionally manually edited and the reading order was corrected (see Picture 2). Another received output used the WordToEpub tool to convert a manually edited Word file to an Epub file (refer to Table 3). Table 3: Software overview for outputs containing additional manual corrections No. PARTNER SOFTWARE USED GENERATED FORMATS 1 UIBK Abbyy FineReader 14, Adobe Indesign RTF 2 NUK Adobe Acrobat Pro PDF 3 NUK Abbyy FineReader 15, Adobe Acrobat Pro PDF 4 NUK Microsoft Office Word, Adobe Acrobat Pro PDF 5 NUK Microsoft Office Word, WordToEpub, Sigil EPUB 6 BNP LIMB Processing, IRIS OCR DOCX 22 Picture 1: Screenshot of only automatically tagged content in Adobe Acrobat Pro before the tags were edited. On the left side, all of the tags are visible in the order they appear on the page, with each tag representing a specific box on the right side. At this point, heading levels are not yet fixed and the order has not been checked. Picture 2: Screenshot of edited elements on the page and fixed reading order in Abbyy FineReader 15. Elements are presented in colours: green for text and red for picture. The order numbers are visible on the top left of each element. 23 The formats of the provided outputs were: • PDF (15): - Automatically generated outputs (12): 6 outputs were in version 1.4, 4 outputs were in version 1.5 (of which 1 was according to the PDF/UA standard), 1 output was in version 1.6 and 1 output was in version 1.7. Of these 12 PDF outputs, 5 were tagged PDFs. - Outputs with additional manual corrections (3): 1 output was in version 1.5 and was according to the PDF/UA standard, 1 output was in version 1.6 and 1 output was in version 1.7. All three of these outputs were tagged PDFs. • XMLs with TXT (3): all 3 outputs were automatically generated. Evaluation was later done on TXT only. • ePUB (2) - Automatically generated outputs (1): the output was in version 2.0. - Outputs with additional manual corrections (1): the output was in version 3.0. • RTF (2) - Automatically generated output (1): the output was in version 1.5-1.6. - Outputs with additional manual corrections (1): the output was in version 1.9. • DOCX (1) – the output was in the 2007 onwards version. The following summary is based on a review of the completed test report questionnaires: • For 8 outputs, partners reported that they made some changes before importing the scans into their system. These changes concerned changing the resolution to 300 dpi, converting 3 files because there were some problems with uploading, rotating and cropping certain images, and using Image frames and JPEG-Compression. • When asked which image processing steps were used when working with the sample, partners replied that they used: automatic deskewing (in 9 examples), manual deskewing (in 2 examples), automatic and manual deskewing (in 4 examples), automatic cropping (in 3 examples), manual cropping (in 7 examples), automatic and manual cropping (in 4 examples), line straightening (in 2 examples), noise removal (in 1 example), contrast enhancement (in 2 examples), correction of geometric distortion (in 0 examples), binarization (in 1 example), removal of stamps and written notes (in 1 example), and equalising the dimensions (in 3 examples). • With regard to OCR, partners mainly used the English language (13 examples), but they also used Latin (3 examples) and more than one language (4 examples). It was reported that machine learning was used for OCR in only 2 examples. For 14 examples, only automatic OCR recognition was used, while manual corrections were used for 5 examples. 24 • Regarding layout analysis, partners reported that they marked paragraphs (5 examples), columns (4 examples), headers (2 examples), images (3 examples), background images (2 examples) and tables (4 examples). • Regarding reading order, partners reported that they worked on reading order for 4 examples, while no work was done on reading order in 15 examples. • Regarding fixing OCR mistakes, it was reported that mistakes were corrected for 3 examples and were not corrected for 17 examples. 3.1 Results by sample scan number The tables below show the test results by all partners regarding the different criteria in each of the sample scans. The test outputs have been classified into automatically generated outputs (17) and outputs that were additionally manually corrected (6). The results were fully achieved, partially achieved or not achieved. The additionally manually corrected test outputs were delivered in addition to the automatically generated test outputs. Table 4: Results of Scan 1 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA FULLY PARTIALLY PARTIALLY NOT achieved FULLY achieved achieved achieved achieved Pagination 10 0 7 6 Text order 7 2 8 6 Heading 1 2 0 16 5 Heading 2 1 0 18 4 Picture 4 0 14 5 Alt-text picture 0 1 18 3 1 Caption 0 0 20 3 Math (simple) 10 0 7 6 Math (advanced) 0 0 19 4 Special character 11 0 6 Additional observations: - BNP (PDF) text order – pagination disturbs flow of text - MZK (edited XML, TXT) text order – captions appear at the end of the whole text - NLE (PDF) text order – recognised text from right to left, top to bottom - OSZK (PDF) text order – trouble with recognition of columns – order from right to left, top to bottom - UG (PDF) math simple – only one + is wrongly recognised - VKOL (XML, TXT) text order – pagination interrupts the text order (it is placed before captions) 25 Table 5: Results of Scan 2 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 16 1 6 Text order 11 3 3 6 Heading 2 1 16 4 Picture 4 13 5 Alt-text picture 1 16 3 1 Alt-text picture 17 3 1 (chem. formula) Alt-text picture 17 3 1 (chem. formula) Picture (chem. 5 15 5 Formula) Picture (chem. 17 5 Formula) Caption 3 1 3 Math (simple) 15 2 6 Math 1 16 4 (advanced) Special character Additional observations: - NLE (PDF) math – 0 appears instead of O - OSZK (PDF) text order – trouble with recognition of columns – order from right to left, top to bottom - OSZK (PDF) math – 0 appears instead of O - UG (PDF) picture chem. – one picture is not recognised - UIBK (ODM PDF) text order – not all text is recognised - UIBK (ODM RTF) math advanced – some examples are done correctly, but not all - UT (PDF) picture chem. – neither of the two chemistry pictures are recognised Table 6: Results of Scan 3 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Text order 7 2 8 6 Heading 2 1 16 4 Heading 2 1 16 4 Picture 5 12 5 Alt-text picture 1 15 3 1 Caption 1 3 Stamp removal 3 14 4 Math (simple) 26 Additional observations: - MZK (edited XML, TXT) text order – captions appear at the end of the whole text - MZK (small edited XML, TXT) text order – the first paragraph appears at the end - OSZK (PDF) text order – columns are not recognised, so the text flows in rows from left to right - OSZK (PDF) text order – trouble with recognition of columns – order from right to left, top to bottom - UG (PDF) caption – a tag is created, but it does not contain the right text - UIBK (RTF) pagination – the original does not have pagination here - UIBK (ODM PDF) text order – the order in the PDF is not correct – it flows from right to left - UIBK (ODM RTF) text order – not all of the text is recognised - VKOL (XML, TXT) text order – the caption columns are switched (the right column appears before the left one) Table 7: Results of Scan 4 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 13 4 6 double Pagination 10 6 5 1 Text order 6 10 3 2 Picture 17 5 Alt-text picture 17 3 Caption 17 2 Additional observations: - OSZK (PDF) text order – trouble with recognition of the columns – order from right to left, top to bottom - UIBK (RTF) – it is unclear how the chemical elements were presented in the table Table 8: Results of Scan 5 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA FULLY PARTIALLY PARTIALLY NOT achieved FULLY achieved achieved achieved achieved Pagination 12 5 6 Text order 6 5 6 6 Heading 1 1 1 15 5 Picture 6 11 5 Picture 4 2 11 5 Picture 6 11 5 Alt-text picture 1 3 1 27 Caption 1 2 14 1 3 Caption 1 2 14 1 3 Caption 1 3 Math (advanced) 17 4 Additional observations: - NCU (PDF) heading 1 – the heading is marked, but as a level 3 heading - NCU (PDF) caption – one caption is missing (it is marked as a paragraph) - NLS (PDF) text order – minor mistake in text order - OSZK (PDF) text order – trouble with recognition of the columns – order from right to left, top to bottom - UG (PDF) text order – the first two captions are switched (the second caption appears before the first one) - UG (PDF) caption – a tag is created, but the content is switched between the first two captions; the third caption is tagged, but the content is not correct - UG (PDF) picture – the scheme is divided into five pictures - UIBK (ODM PDF) text order – minor mistakes - UIBK (ODM RTF) text order – missing text - UT (PDF) text order – the first two captions are switched (the second caption appears before the first one) - UT (PDF) caption – a tag is created, but the content is switched between the first two captions; the third caption is missing - VKOL (XML, TXT) text order – the first two captions are switched (the second caption appears before the first one) Table 9: Results of Scan 6 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Text order 1 5 11 5 1 Heading 1 2 15 5 Heading 2 1 16 4 Heading 2 17 4 Heading 2 17 3 Heading 2 17 3 Picture 3 14 5 Picture 3 14 5 Alt-text picture 1 16 3 Alt-text picture 1 16 3 Caption 2 15 3 Caption 1 1 15 3 Language 17 2 1 segments 28 Additional observations: - BNP (PDF) text order – caption interrupts the flow of the text - BNP (DOCX) text order – one picture is misplaced - CVTI SR (PDF) text order – minor mistake in text order (the caption before the last line) - CVTI SR (PDF) – the scan has better contrast due to the white background, which is better for OCR as well as for users - MZK (edited XML, TXT) text order – minor mistake – the first caption is at the end of the whole text - MZK (small edited XML, TXT) text order – the columns are not detected, the text follows in one straight line - NCU (PDF) heading 2 – there are four occurrences of heading 2, but only one is marked (the first one) - NLE (PDF) text order – minor mistake in text order (caption before the last line) and the author appears after the title - OSZK (PDF) text order – mixture of text order, horizontal and vertical - UG (PDF) heading 2 – the heading is wrongly tagged - UG (PDF) caption – both captions are tagged, but one does not have the right text - UG (EPUB) text order – incorrect text order (there is one column from the top to the bottom of the whole page, then a second column and a third column, etc.) - UREG (PDF) text order – minor mistake in text order (the caption before the last line) - VKOL (XML, TXT) text order – the caption interrupts the text order (it is placed before the third column) Table 10: Results of Scan 7 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 10 7 2 double Pagination 17 5 Text order 1 16 5 Heading 1 1 16 4 Picture 4 13 5 Picture 4 13 5 Picture 4 13 5 Picture 1 2 14 5 Alt-text picture 1 16 3 1 Alt-text picture 1 16 3 1 Alt-text picture 1 16 3 1 Alt-text picture 1 16 3 1 Caption 2 1 14 1 3 Caption 2 14 1 3 29 Caption 1 2 14 1 3 Caption 2 14 1 3 Initial 8 9 6 Language segm. 17 3 OCR errors (text 4 13 6 in picture 4) Additional observations: - MZK (small edited XML, TXT) text order – the main text is correct, but the picture captions interrupt the flow of the text and the text from the image is also captured - NCU (PDF) caption – the fourth caption has the wrong text (the text is taken from the picture) - NCU (PDF) heading 1 – the title and author are marked as heading 3 and heading 4, respectively - NCU (PDF) picture – the fourth picture is only half recognised (probably because of the text in the picture) - NCU (PDF) initial – the initial is marked as a picture with alt-text, which is “Figure without the caption” - NLS (PDF) text order – not all of the text is OCR recognised (the text in the last two captions and the text in the last column is omitted) - UG (PDF) caption – all four captions are tagged, but two do not have the right text - UG (PDF) picture – the fourth picture is only half recognised (probably because of the text in the picture) - UT (PDF) caption – 2 of 4 captions are tagged (and have the right text) - VKOL (XML, TXT) text order – captions interrupt the text order, the main title is placed within the text Table 11: Results of Scan 8 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 15 2 5 Text order 16 1 6 Additional observations: - NCU (PDF) heading 1 – the heading is marked as heading 5 - NLE (PDF) text order – the page number and heading are not included 30 Table 12: Results of Scan 9 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 17 Text order 13 4 6 Heading 1 1 1 15 4 Heading 2 1 16 4 Heading 3 1 16 3 Additional observations: - NCU (PDF) heading 1, 2, 3 – the parallel title is marked as heading 5 - UG (PDF) heading 1, 2, 3 – all three headings are tagged correctly, but there is an error due to a parallel title that should also be heading 1 Table 13: Results of Scan 10 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 5 12 5 1 Text order 17 5 Caption 2 15 2 Table 2 1 14 5 Table header 1 16 4 Rows 1 16 5 Language 17 2 segments Page rotation 4 1 12 5 Additional observations: - BNP (PDF) page rotation – there is a remark that the page is not rotated visually, but OCR is rotated and correctly recognised - MZK (edited XML, TXT) text order – heading 1 and page number are not recognised - MZK (small edited XML, TXT) text order – the text is not correctly recognised (columns instead of rows) - NCU (PDF) caption – the caption is tagged, but it does not have the right text - NCU (PDF) table – best result without manual corrections! - NLS (PDF) text order – only the table title is OCR recognised - OSZK (PDF) table – a table is created, but without content - UG (PDF) caption – the caption is tagged, but it does not have the right text 31 Table 14: Results of Scan 11 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 16 1 6 Text order 5 6 6 5 Caption 17 1 2 Table 8 9 5 Table header 1 16 4 Rows 3 4 10 5 Caption 17 1 2 Table 8 9 5 Table header 1 16 4 Rows 3 4 10 5 Language 17 1 1 segments Additional observations: - CVTI SR (PDF) table rows – some minor errors in the recognised table rows (there is a problem with two or three lines in one row) - MZK (edited XML, TXT) text order –the titles of the rows are recognised first, followed by the columns from left to right (not the rows) - MZK (small edited XML, TXT) text order – the titles of the rows are recognised first, followed by the columns from left to right (not the rows) - NCU (PDF) table rows – minimal error in row recognition - UG (PDF) table rows – rows are tagged, but incorrectly (should be 13 rows but only 4 are tagged) - UG (PDF) caption – the caption is tagged as heading 3 - UG (EPUB) table rows – table rows are incorrectly formulated/recognised - UIBK (ODM PDF) text order – some minor errors in the recognised table rows (problem with two or three lines in one row) - UT (PDF) table rows – minimal errors in row recognition - VKOL (XML, TXT) table rows – some minor errors in the recognised table rows (there is a problem with two or three lines in one row); in the second table, the first row is placed at the end 32 Table 15: Results of Scan 12 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 15 2 6 Text order 3 1 13 6 Picture 6 11 5 Alt-text picture 1 16 3 Caption 2 15 1 3 Math (simple) 8 9 6 Math 17 4 (advanced) Special 17 6 character Footnotes 17 2 2 Language 17 2 segments Additional observations: - MZK (edited XML, TXT) text order – the captions appear at the end of the whole text - NCU (PDF) picture – the picture is divided into two parts - UIBK (ODM RTF) pagination – the page number is in a text block that is not detectable by assistive technologies Table 16: Results of Scan 13 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 17 3 Text order 17 6 Heading 1 17 4 Picture 6 11 5 Alt-text picture 1 16 3 Caption 1 16 1 3 Language 17 3 segments Special 2 15 2 1 character Additional observations: - BNP (PDF) text order – a figure interrupts the flow of the text - CVTI SR (PDF) pagination – page number not recognised - UT (PDF) caption – the caption is tagged, but the page number is also included in the text 33 Table 17: Results of Scan 14 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Text order 2 15 6 Picture 4 13 5 Picture 3 1 13 5 Picture 3 1 13 5 Alt-text picture 1 16 3 1 Alt-text picture 1 16 3 1 Alt-text picture 1 16 3 1 Additional observations: - UG (EPUB) – the page is doubled - UT (PDF) picture – two of three pictures are tagged; the second and third pictures are merged into one - UT (PDF) text order – the third paragraph is marked as a caption Table 18: Results of Scan 15 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Text order 5 1 11 4 1 Heading 1 17 4 Additional observations: - BNP (PDF) text order – minor mistakes in text order - NCU (PDF) text order – some chapters are marked as a list - UIBK (RTF) text order – we do not think this page should be represented as a table - UT (PDF) text order – the chapters are marked as a list Table 19: Results of Scan 16 AUTOMATICALLY generated outputs Additional MANUAL correction CRITERIA PARTIALLY PARTIALLY FULLY achieved NOT achieved FULLY achieved achieved achieved Pagination 13 4 6 Text order 12 3 2 6 Heading 2 17 4 Table 5 1 11 5 Table header 1 16 4 Rows 2 4 11 5 Caption 3 14 3 34 Additional observations: - MZK (edited XML, TXT) text order – most of the numbers in the table are missing - MZK (small edited XML, TXT) text order – some text from the table is missing - NCU (PDF) table header – the table header is marked but has the wrong text - NCU (PDF) caption – the caption is tagged but has the wrong text - NLS (PDF) text order – numbers in the table and the table caption are not recognised - UG (PDF) caption – the caption is tagged but does not have the right text - UG (EPUB) text order – the top cells of the table are missing - UIBK (ODM PDF) text order – the top rows of the table are missing - UT (PDF) caption – the caption is tagged but does not have the right text General observations: - NCU (PDF) – all of the pictures have alt-text, but the content is correct (the text is the content of the caption) - NCU (PDF) – most of the headings are marked, but the levels are not correct in some cases - NCU (PDF) – Scans 13 and 15 are doubled: OCR and whole page layout picture - NLE (PDF) – mixed text order in most scans - NLE (PDF) – OCR works much better for newspaper than for monographs! - UG (EPUB) – the file for Scans 10–12 have a table of contents - UIBK (RTF) – heading 1 and 2 should be used; “titel mit abstand” was used as well as its copy for heading 2 - UIBK (RTF) – no pictures were included - UIBK (RTF) – alt-text is included with the text instead of behind a picture (no pictures) - UIBK (RTF) – the caption is marked, but not with the function (“insert caption”) - UT (PDF) – none of the scans are cropped, but we noticed that OCR recognised some characters from the next page - VKOL (XML, TXT) – the original pagination is marked in the top right corner of each scan in the digital library portals! 35 3.2 Results according to criteria Tables 20 and 21 show the results according to each of the established criteria. For easier understanding, the top results for each criterion (shown in bold) are further described. Table 20: Results according to criteria for the automatically generated outputs RTF Ref. XML XML XML PDF PDF PDF PDF PDF PDF PDF PDF PDF PDF PDF PDF 1.5- ePUB File formats no. AND AND AND 1.4 1.4 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.5/UA 1.6 1.7 1.6 2.0 TXT TXT TXT (ODM) EODOPEM CVTI MZK MZK NUK NLE NLS UREG UIBK* OSZK UIBK* UG* NCU* UT* BNP VKOL UIBK UG Partners SR ED. ed. Alt-text 17B 18 picture Alt-text chemical 2 formula 5A 9A 1A Caption 19 9B 3B 6B Footnotes 1 Heading 1 7 4A 2A 3B Heading 2 10 1A 5A Heading 3 1 1A Initial 1 1A 1A 1A 1A 1A 1A 1A 1A Language 6 segment Math. 3 3A 2A 3A 2A 3A 2A 2A 2A 2A 3A 2A 3A 1A 1A 2A (simple) Math. (adv.) 4 1B OCR errors 1 1A 1A 1A 1A Page 1 1A 1A 1A 1B 1A rotation Pagination 12 9A 6A 5A 8A 8A 7A 6A 7A 9A 8A 8A 6A 7A 7A 8A 2A 1A Pagination 2 1A 2A 2A 1A 2A 2A 2A 2A 1A 2A 2A 2A 2A double 15A 13A 14A Picture 18 6A 6A 16A 1B 2B 2B 1B Picture chem. 2 1A 1A formula Primary 1 1A 1A language Special 3 2A 2A 2A 2A 2A 1A 2A 1A 2A 2A 2A 1A 2A 2A 2A 1A 2A character Stamp 1 1A 1A 1A removal 2A 2A Table 4 3A 3A 3A 4A 3A 3A 1B 1B Table 4 3A 1B header 1A 2A 2A Table rows 4 2A 1B 2A 1B 2B 3B 2B 2B 1B 2A 5A 10A 9A 8A 6A 7A 6A 3A 6A Text order 16 7A 9A 1B 3A 3B 5A 3A 3B 12A 10A 2B 3B 1B 1B 2B 5B 3B 2B 3B 1B Used codes: A = fully achieved criterion; B = partly achieved criterion; empty cell = criterion was not achieved; * = tagged PDF 36 Table 21: Results according to criteria for the outputs with additional manual corrections PDF 1.6 PDF Ref. PDF 1.7 DOCX ePUB 3.0 File formats ADOBE ACROBAT 1.5/UA RTF 1.9 no. WORD 2007- WORD PRO ABBYY 15 EODOPEM NUK* NUK* NUK* UIBK BNP NUK Partners Alt-text 18 18A 13B 18A 18A picture Alt-text chemical 2 2A 2B 2A 2A formula Caption 19 11A 1B 19B 14B 19B Footnotes 1 1B 1B 1A 1A Heading 1 7 7A 3A 7A 7A 7A Heading 2 10 10A 1A 10A 8A 10A Heading 3 1 1A 1A 1A Initial 1 1A 1A 1A 1A 1A 1A Language 2A 5A 6 6A segment 1B 1B Math. 3 3A 3A 3A 3A 3A 3A (simple) Math. (adv.) 4 4A 4A 4A 4A OCR errors 1 1A 1A 1A 1A 1A 1A Page rotation 1 1A 1A 1A 1A 1A Pagination 12 11A 8A 1B 12A 10A 1B 9A 12A Pagination 2 1A 1A 2A 2A 1A 1A double Picture 18 18A 18A 18A 18A 18A Picture chem. 2 2A 2A 2A 2A 2A formula Primary 1 1A 1A 1A 1A 1A 1A language Special 3 2A 2A 2A 2A 2A 2A character Stamp 1 1A 1A 1A 1A removal Table 4 4A 4A 4A 4A 4A Table header 4 4A 4A 4A 4A Table rows 4 4A 4A 4A 4A 4A 13A Text order 16 14A 15A 15A 1B 15A 1B 15A 1B 1B Used codes: A = fully achieved criterion; B = partly achieved criterion; empty cell = criterion was not achieved; * = tagged PDF • ALT-TEXT PICTURE (18) The best result among the automatically generated outputs was achieved by the PDF/UA format from the Nicolaus Copernicus University in Torun (17B). The best result among outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (18A). The same result was also achieved by the PDF and ePUB formats created by Microsoft Word with manual corrections by the National and University Library (Slovenia). 37 • ALT-TEXT PICTURE (CHEM. FORMULA) (2) None of the automatically generated outputs achieved this criterion, but the best result among the outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (2A). The same result was also achieved by the PDF and ePUB formats created by Microsoft Word with manual corrections by the National and University Library (Slovenia). • CAPTION (19) The best result among the automatically generated outputs was achieved by the PDF/UA format from the Nicolaus Copernicus University in Torun (9A 3B) and the PDF format from the University of Greifswald (5A 9B). The best result among the outputs with additional manual corrections was achieved by the PDF/UA format created by the latest desktop version of Abbyy FineReader from the National and University Library (Slovenia) (11A 1B). • FOOTNOTES (1) None of the automatically generated outputs achieved this criterion, but the best result among the outputs with additional manual corrections was achieved by the RTF format from the University of Innsbruck (1A). The same result was also achieved by the ePUB format created by Microsoft Word with manual corrections by the National and University Library (Slovenia). • HEADING 1 (7) The best result among the automatically generated outputs was achieved by the PDF format from the University of Greifswald (4A). The best result among the outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (7A). The same result was also achieved by the PDF format with manually corrected tags in Adobe Acrobat Pro, as well as by the PDF and ePUB formats created by Microsoft Word with manual corrections, all three of which were from the National and University Library (Slovenia). • HEADING 2 (10) The best result among the automatically generated outputs was achieved by the PDF/UA format from the Nicolaus Copernicus University in Torun (5A). The best result among the outputs with additional manual corrections was achieved by the PDF with manually corrected tags in Adobe Acrobat Pro, as well as by the PDF and ePUB formats created by Microsoft Word with manual corrections, all three of which were from the National and University Library (Slovenia) (10A). 38 • HEADING 3 (1) The best result among the automatically generated outputs was by achieved the PDF format from the University of Greifswald (1A). The best result among the outputs with additional manual corrections was achieved by the PDF with manually corrected tags in Adobe Acrobat Pro, as well as by the PDF and ePUB formats created by Microsoft Word with manual corrections, all three of which were from the National and University Library (Slovenia) (1A). • INITIAL (1) Among the automatically generated outputs, 8 PDF outputs fully achieved this criterion: the National Library of Estonia, the National and University Library (Slovenia), the National Library of Sweden, Slovak Centre of Scientific and Technical Information, the University Library Regensburg, the National Széchényi Library, the National Library of Portugal and the University of Tartu Library. All six outputs with additional manual corrections fully achieved this criterion. • LANGUAGE SEGMENTS (6) None of the automatically generated outputs achieved this criterion, but the best result among those with additional manual corrections was achieved by the ePUB format created by Microsoft Word with manual corrections from the National and University Library (Slovenia) (6A), closely followed by the PDF format created by Microsoft Word with manual corrections, also from the National and University Library (Slovenia) (5A 1B). • MATH (SIMPLE) (3) The best possible result (3A) among the automatically generated outputs was achieved by five PDF outputs: the National Library of Sweden, the National and University Library (Slovenia), the University Library Regensburg, the National Library of Portugal and the Nicolaus Copernicus University in Torun. All six of the outputs with additional manual corrections fully achieved this criterion. • MATH (ADVANCED) (4) The best result among the automatically generated outputs was achieved the RTF format from the University of Innsbruck (1B). The best result among the outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (4A). The same result was also achieved by the RTF format from the University of Innsbruck, as well as by the PDF and ePUB formats created by Microsoft Word with manual corrections from the National and University Library (Slovenia). • OCR ERRORS (1) The best possible result (1A) among the automatically generated outputs was achieved by four PDF formats from the National Library of Sweden, the National Széchényi Library, the 39 National Library of Portugal and the University of Tartu Library. All six of the outputs with additional manual corrections fully achieved this criterion. • PAGE ROTATION (1) Four of the automatically generated outputs used page rotation. The PDF outputs were: the National Library of Estonia, the Slovak Centre of Scientific and Technical Information and the University Library Regensburg. The same result was achieved by the edited XML and TXT format from the Moravian Library. All of the outputs with additional manual corrections fully achieved this criterion, except for the PDF format with manually corrected tags in Adobe Acrobat Pro from the National and University Library (Slovenia). On verifying whether page rotation influenced any of the other criteria, it was observed that, at least in the presented outputs, this criterion did not influence the recognised table elements, as these PDF outputs were tagged PDFs. Nor did it influence the text order. It did, however, influence OCR recognition, as all four automatically generated outputs had good OCR, whereas the other examples were not always good (see Picture 3). Picture 3: Comparison of two OCR outputs. The first example (left) is the output when the page was rotated, and the second example (right) is the output when the page was not rotated (poor OCR output). 40 • PAGINATION (12) The best result among the automatically generated outputs was achieved by the PDF format from the University of Greifswald and the National and University Library (Slovenia) (9A). The best result among the outputs with additional manual corrections was achieved by the PDF and ePUB formats created by Microsoft Word with manual corrections by the National and University Library (Slovenia). • PAGINATION–DOUBLE (2) Ten of the automatically generated outputs did not split double pages. The PDF outputs were: the National Library of Estonia, the National Library of Sweden, the National Széchényi Library, the University Library Regensburg, the University of Greifswald, the Nicolaus Copernicus University in Torun and the National Library of Portugal. The same result was achieved by the edited and small edited XML and TXT format from the Moravian Library and the ePUB format from the University of Greifswald. Two of the outputs with additional manual corrections also failed to split double pages: the RTF format from the University of Innsbruck and the PDF format created by Microsoft Word from the National and University Library (Slovenia). We verified whether the pagination-double criterion influenced any of the other criteria and observed that, at least in the presented outputs, this criterion did not influence any other criteria. It did, however, influence OCR recognition, as some of the automatically generated outputs had better OCR with regard to the full title and author that were spread over a double page, but this did not work on all of the examples (see Picture 4). Picture 4: Comparison of two OCR outputs. The first example is the output when the double page was not split in two, resulting in the entire title and author at the top. The second example is the output when the page was split in two, resulting in only part of the title and 41 author appearing. Some other OCR differences are visible, but they not related to the criterion pagination-double. • PICTURE (18) The best result among the automatically generated outputs was achieved by the PDF format from the Nicolaus Copernicus University in Torun (16A 1B) and the PDF format from the University of Greifswald (15 A 2 B). All of the outputs with additional manual corrections achieved the best result (18A), except for the RTF format from the University of Innsbruck. • PICTURE (CHEM. FORMULA) (2) The best result among the automatically generated outputs was achieved by the PDF and ePUB formats from the University of Greifswald (1A). All of the outputs with additional manual corrections achieved the best result (2A), except for the RTF format from the University of Innsbruck. • PRIMARY LANGUAGE (1) Two of the automatically generated outputs had primary language added to the document: the PDF format from the Nicolaus Copernicus University in Torun and the RTF format from the University of Innsbruck. All six of the outputs with additional manual corrections fully achieved this criterion. • SPECIAL CHARACTER (3) The best result (2A) among the automatically generated outputs was achieved by the PDF format from the National and University Library (Slovenia), the National Library of Estonia, the National Library of Sweden, the Slovak Centre of Scientific and Technical Information, the University Library Regensburg, the National Széchényi Library, the University of Greifswald, the Nicolaus Copernicus University in Torun and the University of Tartu Library. The same result was achieved by the edited and small edited XML and TXT format from the Moravian Library, the XML and TXT format from the Olomouc Research Library and the ePUB format from the University of Greifswald. All six outputs with additional manual corrections achieved the same result at this criterion (2A). • STAMP REMOVAL (1) Three of the automatically generated outputs removed the stamp on a scan: the PDF format from the National Széchényi Library and the edited and small edited XML and TXT format from the Moravian Library. Four of the outputs with additional manual corrections also removed the stamp: the RTF format from the University of Innsbruck, the docx format from the National Library of Portugal, and the PDF and ePUB formats created by Microsoft Word from the National and University Library (Slovenia). 42 On verifying whether the removed stamp influenced OCR recognition in the area of the stamp, it was found that all three of the automatically generated outputs, as well as the outputs with manual corrections, had clean OCR with no mistakes in the paragraph concerned, in comparison to outputs in which the stamp had not been removed (see Pictures 5 and 6). Picture 5: Comparison of two PDF outputs. The first example is the output when the stamp was removed, resulting in a clean text. The second example is the output when the stamp was not removed, which creates reading difficulties. Picture 6: Additional comparison of the two OCR outputs. The first example is the output when the stamp was removed, resulting in a correct text with no mistakes. The second example is the output when the stamp was not removed, resulting in mistakes in the text that cause reading difficulties (especially with speech synthesis). 43 • TABLE (4) The best result among the automatically generated outputs was achieved by the PDF format from the Nicolaus Copernicus University in Torun (4A), closely followed by two examples of the PDF format from the University of Innsbruck, the PDF format from the University of Greifswald and the University of Tartu Library, and the RTF format from the University of Innsbruck, all of which achieved the result 3A. All of the outputs with additional manual corrections achieved the best result, except for the PDF format with manually corrected tags in Adobe Acrobat Pro from the National and University Library (Slovenia). • TABLE HEADER (4) The best result among the automatically generated outputs was achieved by the PDF format from the Nicolaus Copernicus University in Torun (3A 1B). The best result among the outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (4A). The same result was also achieved by the PDF and ePUB formats created by Microsoft Word with manual corrections and the PDF/UA format created by the latest desktop version of Abbyy FineReader from the National and University Library (Slovenia). • TABLE ROWS (4) The best result among the automatically generated outputs was achieved by the PDF format from the Nicolaus Copernicus University in Torun (2A 2B). All of the outputs with additional manual corrections achieved the best result, except for the PDF format with manually corrected tags in Adobe Acrobat Pro from the National and University Library (Slovenia). • TEXT ORDER (16) The best result among the automatically generated outputs was achieved by the PDF format from the Nicolaus Copernicus University in Torun (12A). The best result among the outputs with additional manual corrections was achieved by the docx format from the National Library of Portugal (15A 1B). The same result was also achieved by the PDF and ePUB formats created by Microsoft Word with manual corrections from the National and University Library (Slovenia). 44 4 Test findings Although manual corrections can significantly improve OCR quality, some exceptions were found, as described below. Since manual correction is time consuming, the objective is to achieve the best automatic outputs before any manual corrections are needed. The test results show that the best outputs were achieved using PDF/UA as a delivery format or tagged PDF. These file formats dealt with all of the criteria better than the other file formats. The only shortcoming is the inability to visually adapt the content to specific needs, as described in section 1.2 Description of the report. However, some criteria were only partially met: either they were almost met or they were technically adequate but the content was not related (e.g., the Alt-text field was assigned, but the content was not correct – the text corresponded to the caption, or the caption tag was assigned, but the text inside the element was not the text corresponding to the image). The alt-text criterion exemplified the most problems, as it is an element that currently requires human input. For the blind and partially sighted, the most acceptable delivery formats are Microsoft Word files (RTF and doc) and ePUBs in annotated PDF format. There are few criteria that are as important for mobile devices as they are for the blind and partially sighted, but output that is well adapted for the blind and partially sighted is also friendlier for mobile devices. The most useful formats for mobile devices are next delivery file formats: EPUBs and Microsoft Word files (RTF in docx). Average scan qualities were deliberately used in order to test the partners’ OCR tools to the greatest possible extent. Among other factors, the structure of the scan is very important. The results may also have been different if the test sample had comprised texts from a single publication. In this case, the texts would have had the same structure, or at least a similar one (e.g., the allocation of title tags is based on the size and font of the titles). Different scans also caused problems for some partners, as the system did not accept scans of different sizes or different systems were used for monographs and newspapers. Some partners solved this problem by importing each scan separately. Some problems were due to specific elements in the scan, such as the table of contents in Scan 15 . The OCR output was plain text, although in the case of a whole book, internal links to individual chapters would be desirable. The results were also influenced by the complexity of the structure of the elements in the scans. Scans with a simple (one column) structure had fewer errors than those with a complex structure (multiple columns and other elements). In addition, the testing gave rise to the following findings: • Except for the PDF/UA format, no links are evident between format versions or standards. 45 • Page rotation enabled better recognition of tables. • In the case of double pages, scans with joint title and author’s name spread over both pages were not recognised as a joint element: the texts of the title and author’s name were affected when the double pages were split in two. • There is a need to find a method for analysing partners’ workflows and assessing the potential impact on results, and for determining how additional manual work affects the results (BNP, UIBK, NUK). We should take in consideration the fact that delivery formats that meet the needs of the blind and partially sighted also enable a better experience for users of mobile devices. None of the EODOPEN partners produce audiobooks, so we were unable to analyse this aspect. 46 5 Possible solutions and recommendations Some recommendations and solutions concerning delivery formats for mobile devices as well as for print-disabled users are presented below. There are two possible paths that libraries and other institutions can choose, depending on the users’ needs. In order to achieve the best possible user experience, we also offer a third option, i.e., the integration of both models, thereby increasing accessibility for everyone. 5.1 Solutions for mobile devices The biggest problem with mobile devices, especially mobile phones, is the small screen size, for which non-responsive delivery file formats are not recommended. In addition, it is more difficult to search for parts of a publication on mobile devices. It is therefore recommended to enable easier navigation through the work, at least by the main chapters or by the original page numbers or other specified landmarks. It is also important to allow a format that enables at least basic visual adaptations of the text to the personal needs of the users (text background, font, text size, etc.). For better use of publications on mobile devices, we suggest minimal manual interventions in the publications themselves. Based on the analysis of the test results, a survey among users and the reviewed literature in D11: Guidelines and Recommendations…, we recommend the following: 1. Delivery file formats that are adaptable to screens should be used (EPUB, MOBI, AZW, HTML and variations of Microsoft Word documents). These formats also enable additional functionalities, such as adding bookmarks, changing the visual appearance, etc. 2. Among the above-mentioned formats, Microsoft Word variations, EPUB and HTML are open and not proprietary file formats compared to MOBI and AZW. The proprietary file formats should be used only when we are aware that the user has appropriate software to access the content. 3. When using PDF as a delivery file format, we recommend selecting tagged PDF or PDF/UA. 4. We suggest enabling a table of contents or structural tags that mark the headings in the publication, thus allowing navigation within the applications/programs for reading on the devices themselves. Alternatively, a page with a table of contents can be added. 5. Particular attention should be paid to “page rotation”, “pagination double” and “stamp removal”, as these criteria have been shown to improve visual appearance, as well as OCR. However, we suggest deciding on this on a case-by-case basis. For the not proprietary file formats PDF, PDF/UA, DOCX, RTF and EPUB, the following software were mostly used to generate the formats: Abbyy Finereader, Microsoft Office Word, Adobe InDesign, Adobe Acrobat Pro and WordToEpub. The results vary among 47 software and among the amount of manual work put into creation of the format so we can not give any specific recommendation. We should consider the fact that users of mobile devices can also be users of assistive technologies such as speech synthesis, as more and more sighted people enjoy listening to audio publications. In this case, solutions for print-disabled users should be applied in order to ensure access to the widest possible group of users. 5.2 Solutions for print-disabled users Solutions for the blind and partially sighted, as well as other people who have problems accessing conventional print or electronic publications, are more complex and require more work, time and specialised knowledge. It is important to have access not only to publications, but also to assistive technology and, through this, to achieve a fluid flow of the text, despite complex elements and demanding page structure. In this regard, the most important criteria are: text order, OCR clean-up, primary language and language segments. Due to linear reading, it is necessary to allow navigation to different locations in the publication (via chapters or original pages of the publication or other landmarks). There is also a need to facilitate the understanding of visual elements. Special elements (tables, table headers, captions, footnotes, hyperlinks, complex mathematical notations, etc.) should be adapted to function technically with the help of assistive technologies. Last but not least, it is also important to enable people with residual vision to visually adapt the appearance of the publication to their personal needs (enlargement of the text, background of the text, change of font, etc.) We recommend: 1. Undertaking an OCR clean-up and fixing the text order. 2. Paying attention to text contrast – scanned PDFs usually have low contrast between text and background due to the colour of the paper. 3. Adding document language and marking segments that are in a different language. 4. Adding navigation segments for chapters, subchapters, original page number, captions, footnotes, hyperlinks, etc. 5. Adding descriptions for visual elements that contribute additional value to the surrounding text, e.g., alt-text for images, graphs, etc. 6. Fixing the structure of tables: table headers, table rows and table cells. 7. Devoting special attention to mathematical expressions. If possible, MathML and/or Latex should be used. 8. Specifically for the blind: using formats that do not contain visual elements and that support assistive technology, e.g., TXT or variations of Microsoft Word documents. In this case, the visual elements are not needed, but alt-text is even more crucial. 9. Specifically for the visually impaired: using formats that are adaptable to screens and that also enable other modern functions for working with the material, e.g., adding 48 bookmarks, changing the visual appearance, etc. (variations of Microsoft Word documents, EPUB, HTML, MOBI, AZW). Consider open and not proprietary file formats. 10. Using tagged PDF or PDF/UA when using the PDF format. 11. Testing at least one assistive technology or using a test group of blind and partially sighted people and implementing their observations in future workflows. 49 6 Summary The aim of the Report on Trial Implementations for Mobile Devices and Print-Disabled Users is to help libraries and other cultural organisations to make digitised content available to a broader community. The Report is based on EODOPEN partners’ digitisation experiences at their organisations and complements the EODOPEN Project Deliverable 11: Guidelines and Recommendations for the Provision of Alternative and Special Formats, which addresses delivery formats and criteria for increasing the quality of digitisation results for users of mobile devices as well as blind and partially sighted users. In order to find out which scanning and recognition workflows are optimal for achieving the best results in OCR, a trial implementation among EODOPEN partners was undertaken. One of the goals was to determine which file formats could be generated, as different file formats can give users different user experiences. The test sample consisted of 16 scans in the TIFF format (see Annex 1), comprising both textual and non-textual elements, such as plain text, chapters and sub-chapters, columns, tables, footnotes, flowcharts, images and text accompanying images (captions). In order to obtain comparable results, it was decided to choose text samples in English and distribute them to all of the project partners. In addition to the scan samples, each partner received a test report questionnaire in which they described the different stages in their digitisation workflows. For the evaluation of the results, 24 criteria were prepared. These criteria were based on WCAG to ensure the optimal accessibility of the documents and other best practice guidelines. The criteria are: alt-text picture, alt-text picture (chemical formula), caption, footnotes, heading 1, heading 2, heading 3, initial, different language segments, mathematical formulas (simple), mathematical formulas (advanced), OCR errors (text in Picture 4 on Scan 7), page rotation, pagination, pagination-double, picture, picture (chemical formula), primary language setting, special character, stamp removal, table, table header, table rows and text order. A total of 23 test results from 13 partner institutions were received and analysed. These included results of automatically generated outputs (17), as well as outputs that contained additional manual corrections (6). The software packages used for testing the samples were: ABBYY FineReader, ABBYY FineReader 11, ABBY Recognition server 4, ABBY Recognition server 14, ScanGate by Treventus Mechatronics, ABBYY FineReader PDF 15 Standard, Abbyy Finereader 15 desktop version, Adobe Acrobat Pro, IRIS OCR, LIMB processing, Microsoft Office Word, Scan Tailor Advanced v1.01.16, Tesseract 5.0.0-beta-20210815-22-g386dd, Photoshop 23.2.2., Project PERO OCR and WordToEpub. 50 The findings showed that manual corrections could significantly improve OCR quality. However, such corrections are time consuming and the focus was therefore on automatic processing. The test results showed that the best outputs were achieved using PDF/UA as the delivery format or tagged PDF. These file formats dealt with all of the criteria better than the other file formats. However, some criteria were only partially met: either they were almost met or they were technically adequate but the content was not related. The alt-text criterion exemplified the most problems, as it is an element that currently requires human input. Formats that do not contain visual elements and support assistive technology are most suitable for the blind, such TXT or variations of Microsoft Word documents. For the partially sighted, the use of formats that are adaptable to screens and enable other modern functions for working with material are most suitable, such as variations of Microsoft Word documents, ePUB, HTML, MOBI or AZW. Another factor that should be taken in consideration is that delivery formats that meet the needs of the blind and partially sighted also enable a better experience for users of mobile devices. The most useful formats for mobile devices are next delivery file formats: ePUB, MOBI, AZW, HTML and variations of Microsoft Word documents. Since none of the EODOPEN partners produce audiobooks, so this aspect was not part of the analysis. At the end of the report, some recommendations and solutions concerning delivery formats for mobile devices and for print-disabled users are presented. 51 7 Reference Accessible document solutions. (s.a.). An Introduction to PDF Tags: The key ingredients in an accessible tagged PDF. Available at: https://accessible-docs.com/tagging-accessible- pdf/ Learn about sending documents to your Kindle library. (s.a.) Amazon. Available at: https://www.amazon.com/gp/help/customer/display.html?ref_=hp_left_v4_sib&nodeId=G 5WYD9SAF7PGXRNA Guidelines and recommendations for the provision of alternative and special formats based on the survey on special needs of users and technical requirements. (2022). EODOPEN Project Deliverable D11. Available at: https://eodopen.eu/outputs. 52 8 Vocabulary ALTERNATIVE TEXT (ALT-TEXT) – Alternative text provides a textual description for non-text content (pictures, graphics, diagrams …). ASSISTIVE TECHNOLOGIES – “… any item, piece of equipment, software program, or product system that is used to increase, maintain, or improve the functional capabilities of persons with disabilities.” (Source: ATIA, https://www.atia.org/home/at- resources/what-is-at/) DELIVERY FILE FORMAT – the final file formats accessed by the users. DIGITAL CONVERSION – digitisation DIGITISATION WORKFLOW – all the processes implemented during the digitisation process from image capturing, image processing, OCR production …, to the conversion of scanning file format to archival and access file formats. EBOOK – the term eBook usually refers to born-digital publications. However, we use the term of eBook especially referring to digital publications produced as a result of digital conversion, including formats for special needs (audiobooks), which is also the aim of EODOPEN project IMAGE CAPTURING – scanning. IMAGE PROCESSING – “Image processing is a method to perform some operations on an image, in order to get an enhanced image or to extract some useful information from it.” (Source: Digital Image Processing, University of Tartu, https://sisu.ut.ee/imageprocessing/book/1). MOBILE DEVICES – were mobile or smartphones, laptops, and tablet computers. PARTIALLY SIGHTED – “People who are partially sighted are not completely blind but are able to see very little.” (Source Cambridge Dictionaire, https://dictionary.cambridge.org/dictionary/english/partially-sighted). Use for visually impaired. PRINT DISABLED – “The term “print disabled” was coined by George Kerscher, Ph.D. around 1989 to describe persons who could not access print. He used it to refer to: A person who cannot effectively read print because of a visual, physical, perceptual, developmental, cognitive, or learning disability.” (Source: https://myblindspot.org/mbs-accessibility-defined/). PROPRIETARY FILE FORMATS – formats that rely on specific software for using and the content of the file can’t be read without that software, ex. MOBI, AZW RESPONSIVE FILE FORMAT – is a format that enables the text to adjust to any screen size. SCREEN READER – “Screen readers perform a text to speech role, but also allow audio-only access to the menus and other features of the delivery platform” (McNaught and Alexander, 2014) 53 TAGGED PDF – PDF which contains tags for each page element and enables easier access to document’s content with assistive technologies. TEXT TO SPEECH – “Text to speech is a mature technology that allows text on screen to be voiced by software. (McNaught and Alexander, 2014) VISUALLY IMPAIRED – see partially sighted. 9 Used acronyms EBU – European Blind Union. EODOPEN – eBooks-On-Demand-network Opening Publications for European Netizens – European project cofinanced under Creative Europe program from 2019-2023. EODOPEN PARTNERS ACRONYMS BNP -National Library of Portugal CVTI SR - Slovak Centre of Scientific and Technical Information MZK - Moravian Library NCU - Nicolaus Copernicus University in Torun NLE - National Library of Estonia NLS - National Library of Sweden NUK - National and University Library OSZK - National Széchényi Library UG, University of Greifswald UIBK - University of Innsbruck UREG - University of Regensburg UT - University of Tartu VKOL - Research Library Olomouc OCR – Optical Character Recognition WCAG – Web Content Accessibility Guidelines Acronyms for file formats: AZW – Amazon Word docx - Microsoft Word Open XML Format ePUB - electronic publication HTML - Hyper Text Markup Language MOBI - MOBI file format (Mobipocket eBook format) PDF – Portable Document Format RTF – Rich Text Format 54 10 Annexes Annex 1. Testing samples 55 Scan 1 56 Scan 2 57 Scan 3 58 Scan 4 59 Scan 5 60 Scan 6 61 Scan 7 62 Scan 8 63 Scan 9 64 Scan 10 65 Scan 11 66 Scan 12 67 Scan 13 68 Scan 14 69 Scan 15 70 Scan 16 71 Annex 2: Testing report questionnaire A12 TEST REPORT (SAMPLE) Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: _______ Which software for image processing and OCR did you use for this sample? _____ 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): _______ 72 Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? _______ 73 Annex 3. Testing report questionnaires by partners 74 P1 - UIBK, University of Innsbruck – RTF format A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: University of Innsbruck Which software for image processing and OCR did you use for this sample? Abbyy FineReader 14, Adobe Indesign 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Convert from tif to jpeg with irfan view because of error messages in abbyy fine reader 14 for 3 files (“Möglicherweise ist die Datei beschädigt”) 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS 75 OCR: character recognition: • English • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: additional work: add origpage, caption, alt-text 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. We use Abbyy Finereader only for layout analysis, text recognition and correction. All other processes (markup of headers, adding elements such as origpage, caption, footnotes, etc.) are then carried out in Adobe Indesign. Finally, the table of contents is created in Microsoft Word. 5. EXPORT Any additional comments? We export and deliver the file as RTF. 76 P1 - UIBK, University of Innsbruck – ODM workflow – PDF and RTF format A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: University of Innsbruck Which software for image processing and OCR did you use for this sample? Abbyy FineReader recognition server 4 - testing the ODM workflow 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Convert 3 files because there were some problems with uploading 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS 77 OCR: character recognition: • English • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? We exported the files in PDF, PDF/A, alto, RTF, xml 78 P2 - UT, University of Tartu A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: University of Tartu Library Which software for image processing and OCR did you use for this sample? ABBYY FineReader PDF 15 Standard; ABBYY FineReader Server 14 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: rotated one image 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 79 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: Only done for special cases, projects – rarely. 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? _______ 80 P3 - NUK, National and University Library – usual workflow A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National and University Library (Slovenia) Which software for image processing and OCR did you use for this sample? We use internally developed workflow software which uses Abbyy FineReader engine for image processing and OCR. 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: We usually use equalizing the dimensions of the scans but in this example, we did not use it because scan sizes were too diverse. Because of it, OCR didn’t function correctly. 81 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): latin Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: Automatic recognition recognizes just the columns (ex. newspapers). OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? _______ 82 P3 - NUK, National and University Library – PDF edited with Adobe Acrobat Pro A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National and University Library (Slovenia) Which software for image processing and OCR did you use for this sample? We use internally developed workflow software which uses Abbyy FineReader engine for image processing and OCR. 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: We usually use equalizing the dimensions of the scans but in this example, we did not use it because scan sizes were too diverse. Because of it, OCR didn’t function correctly. 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: 83 • English • other (add): latin Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: Automatic recognition recognizes just the columns (ex. newspapers). OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? For testing purposes the final PDF was edited using Adobe Acrobat pro for manually adding the images, autotagging, editing tags, manually fixed reading order, language segments were added, TOC on scan 15 was nested and footnotes on scan 12 were nested. Tables were turned into images although we know it is not the correct way. We used adobe’s accessibility check to fix any other problem (name and language of the document for example). 84 P3 - NUK, National and University Library – PDF made with Abbyy FineReader 15 A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National and University Library (Slovenia) Which software for image processing and OCR did you use for this sample? For testing purposes was used Abbyy FineReader 15 desktop version. 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 85 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 3. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 4. EXPORT Any additional comments? At export from Abbyy we chose that it is compliant with PDF/A and PDF/UA standard. Additionally, we used Adobe Acrobat Pro for autotagging but tags were not checked. 86 P3 - NUK, National and University Library – PDF and ePUB made from Word A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National and University Library (Slovenia) Which software for image processing and OCR did you use for this sample? We use internally developed workflow software which uses Abbyy FineReader engine for image processing and OCR. 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: We usually use equalizing the dimensions of the scans but in this example, we did not use it because scan sizes were too diverse. Because of it, OCR didn’t function correctly. 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): latin Does OCR software use machine learning? 87 • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? For testing purposes the final TXT file was taken to Microsoft Office Word for further work. We added images, did full OCR clean-up and fixed reading order. We added structure, page numbers, footnotes, hyperlinks, captions to images and tables. We exported as PDF/A-3A. Using Adobe Acrobat pro we did accessibility check, checked reading order and fixed the title of the document. For ePUB we used the workflow we do on EODOPEN for ePUB production. The above clean Microsoft Word file was converted with the tool WordToEpub and then we used Sigil for fixing mistakes and did accessibility check with EpubCheck and Ace by Daisy. 88 P4 - MZK, Moravian Library – small edited A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: Moravian Library Which software for image processing and OCR did you use for this sample? PROJECT PERO OCR https://pero.fit.vutbr.cz/ https://pero-ocr.fit.vutbr.cz/ https://github.com/DCGM/pero-ocr This report is for documents in folder 01_MZK_Small edited (SE) 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Crop images (only selected) Rotate the images (only selected) Resize the images (we need in this tool max. 8 Mb). 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: no further adjustments 89 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): Czech Printed Model + Language Model - English Wikipedia Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? Export is in Page and Alto format (+txt with plain text). 90 P4 - MZK, Moravian Library – edited A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: Moravian Library Which software for image processing and OCR did you use for this sample? PROJECT PERO OCR https://pero.fit.vutbr.cz/ https://pero-ocr.fit.vutbr.cz/ https://github.com/DCGM/pero-ocr This report is for documents in folder 01_MZK_Edited (E) 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Rotate the images (only selected) Resize the images (we need in this tool max. 8 Mb). 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: no further adjustments 91 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): Czech Printed Model + Language Model - English Wikipedia Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 3. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 4. EXPORT Any additional comments? Export is in Page and Alto format (+txt with plain text). 92 P5 - UG, University of Greifswald A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: University of Greifswald Which software for image processing and OCR did you use for this sample? Abby 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Imageframes and JPEG-Compression (Usually done automatically by our workflowsystem) 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: no 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: 93 • English • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: no OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. Yes in regular workflows we use Intranda Layout wizard: automatic deskewing, page cropping, semiautomatic separation of pages, hard quality management no bright pictures, good contrast, but no postprocessing after OCR. 5. EXPORT Any additional comments? - good scans are half the battle - Preliminary work would have to be done by marking the image areas 94 - Reworking the Images and wrongly transscriped Images areas for Accessibility - A workflow without preliminary work or reworking UG Notes about A12 ArchEmig Lines and captions are not identified Stains on the paper (newspaper) are identified as punctuation marks (Epub) Columns are not recognized throughout. The continuous text jumps (txt) Chemestry Formulars were not detected Variables (Greek alphabet) are not identified Lines and captions are not identified EOD-Open Headers are missing (Epub) Headers and pagination not in the right positions (txt) Gromdzenie Table structure are not identified (ePUB and txt) Different fonts are identified (Epub) different grades of transcription Font of recitation could not get transcripted (txt) Problems with fracture and antiqua in cusiv Magazyn Variables (Greek alphabet) identified as special characters Formulars were not detected (in every format) Eegs Problems with fracture and antiqua in cusiv Some normal characters were not detected and transscripted Internat- Layout and framing is not compatible ariculture Transcription is in ePUB right Txt and PDF alright Narrative Upper and lower case wrong (PDF und Epub) Lines and captions are not identified Probably too pale scan. Text not always translated correctly (txt und Epub) Report Translation of the tabular display distorted Probably too pale scan. Text not always transscriped correctly (txt und Epub) Upper and lower case wrong (PDF und Epub) UG To be fair, it is an addendum and it is a relatively recent publication. The page is clean and the typesetting regular. The cusive font has enough spacing between letters. - Problems with the txt and PDF are pretty much the same - good scans are half the battle - Preliminary work would have to be done by marking the image areas - Reworking the Images and wrongly transscribed Images areas for Accessibility 95 P6 - NLS, National Library of Sweden A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National Library of Sweden Which software for image processing and OCR did you use for this sample? ABBYY Finereader 11 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: None of the above 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: 96 • English • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: no OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: no OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? Up until now we have used ODM when digitising orders for the EODOPEN project. We are just about to start using a new system (Limb Processing) and the file we uploaded are made via this system. 97 P7 - NCU, Nicolaus Copernicus University in Torun A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: Nicolaus Copernicus University in Toruń Which software for image processing and OCR did you use for this sample? ABBYY FineReader Server 14.0 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? Resolution to 300 dpi 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: Resolution of all scans is set to 300 dpi 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 98 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? We have some problems with software after the last update. PDFs are linearized before publication. 99 P9 - VKOL, Research Library Olomouc A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: Research Library Olomouc Which software for image processing and OCR did you use for this sample? ScanTailor Advanced v1.01.16 Tesseract 5.0.0-beta-20210815-22-g386dd 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: 100 • English • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? _______ 101 P10 - BNP, National Library of Portugal – PDF A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National Library of Portugal Which software for image processing and OCR did you use for this sample? LIMB Processing and IRIS OCR 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 102 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? _______ 103 P10 - BNP, National Library of Portugal – docx A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National Library of Portugal Which software for image processing and OCR did you use for this sample? LIMB Processing and IRIS OCR 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 104 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? _______ 105 P11 - NLE, National Library of Estonia A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National Library of Estonia Which software for image processing and OCR did you use for this sample? For books files: ABBYY FineReader 11 for image processing and ABBYY Recognition Server 4 for OCR. For newspaper/periodicals: ABBYY FineReader 11 and CCS docWorks 7.1.0.90 for image processing and ABBYY 12 OCR-engine 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English • other (add): _______ 106 Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: At the moment we have 2 different workflows for 1) books and 2)newspapers/periodicals. We do segmentation (with software CCS Docworks) only on periodicals at the moment (book files as the most sample files were are just deskewed, cropped and OCR-d), but there is plan in the near 1-2 years to switch the books also to the segmentation workflow. OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: We are fixing only OCR mistakes of periodical’s Headlines, Captions and Authors, seldom in Textblocks 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? Export files are also different at the moment depending on the type of the item – book files (as most of the testing samples here were) are OCRed PDFs for the user but segmented newspapers/periodicals are jpeg2000 and PDF (1 sample file). As we have 2 different workflows for the books and periodicals, we also have 2 different portals for them as well. But this is going to change in the near future as we are starting to implement a new archival system soon and all the materials will go under segmentation and hopefully in the same portal as well. There are a lot of changes ahead of us in this field :) 107 P12 - OSZK, National Széchényi Library A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: National Széchényi Library Which software for image processing and OCR did you use for this sample? _____ 1-6, 8-11, 14-15 ScanTailor Advanced (1.0.16) Photoshop (v 23.2.2) ABBYY Recognition Server 4.0 7, 13, 16 Photoshop (v 23.2.2) ABBYY Recognition Server 4.0 12 ScanTailor Advanced (1.0.16) ABBYY Recognition Server 4.0 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes (15) • no (others) If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic (1-5, 8-12, 14-15) • manual (6-7, 13, 16) • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) (12) 108 • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarization (1-6, 8, 12, 15) • removal of stamps, written notes (1-5, 8) • equalising the dimensions of the scans (all same size after cropping) Other notes: 1-5: “Mixed” output was chosen while processing the samples using Scan Tailor: the textual content was selected and binarised while the illustrations were remained in color mode. During the binarisation we’ve thickened the letters to make it easier for the OCR-algorithm to recognise the characters. 6: Converting to grayscale, increasing contrast and adjusting levels using Photoshop. 7: “Smart” sharpening, Adjusting levels. 9: “Smart” sharpening, Neural Filters, Removal of negative visual effects caused by JPEG-compression (middle), Adjusting levels. 10-11: “Smart” sharpening, Neural Filters, Removal of negative visual effects caused by JPEG-compression (middle), Adjusting levels. 12: Automatic dewarping. 13: “Smart” sharpening (with noise removal), Neural Filters, Removal of negative visual effects caused by JPEG-compression (middle), Adjusting levels 14: “Mixed” output was chosen while processing the samples using Scan Tailor: the textual content was selected and binarised while the illustrations were remained in color mode. After Scan Tailor processing we adjusted the levels using Photoshop. 16: “Smart” sharpening, Neural filters, Removal of negative visual effects caused by JPEG-compression (middle), Converting to grayscale, Adjusting levels 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English (1-5, 7-16) • other (add): Polish (6) Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition (1-6, 8, 12-16) • additional manual corrections (7) • we do not use it (9-11) Other notes: 7: The automatic recognition had skipped the text on the 5. page, therefore we selected it manually after the automatic processing. OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs 109 • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? _______ 110 P13 - CVTI SR, Slovak Centre of Scientific and Technical Information A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: Slovak Centre of Scientific and Technical Information Which software for image processing and OCR did you use for this sample? ScanGate by Treventus Mechatronics for image post processing ABBYY Recognition Server 4.0 for OCR 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: some pictures - background normalisation, unsharp masking We are using equalising the dimensions of the scans but in this example, we did not use it because images sizes were too difference. 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: 111 • English • other (add): German, German (new spelling), Slovak, Czech Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. no 5. EXPORT Any additional comments? We are doing two outputs pdf files. One output is only for long time archive. The second pdf file output is for using to digital library (mostly with smaller file size with some conversion). 112 P14 - UREG, University of Regensburg A12 TEST REPORT Please, add detailed information! You can also add screenshots or record the testing process. Partner organisation: University Library of Regensburg Which software for image processing and OCR did you use for this sample? ABBYY Recognition Server 4.0 1. IMPORT OF SCANS IN TIFF FORMAT Before uploading the sample files to your system, did you change anything, for instance resolution, scanning format etc.? • yes • no If yes, what did you change? _______ 2. IMAGE PROCESSING Mark which image processing steps you used when working with the sample. Deskewing: • automatic • manual • automatic and manual Cropping: • automatic • manual • automatic and manual Additional steps: • lines straightening (dewarping) • noise removal (denoising) • contrast enhancement • correction of geometric distortion • binarisation • removal of stamps, written notes • equalising the dimensions of the scans (all same size after cropping) Other notes: _______ 3. MULTILEVEL DOCUMENT ANALYSIS AND RECOGNITION OF ELEMENTS OCR: character recognition: • English 113 • other (add): _______ Does OCR software use machine learning? • yes • no OCR: page segmentation – recognition of different elements. Layout segments are classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, heading, …). • only automatic recognition • additional manual corrections • we do not use it Other notes: _______ OCR: additional work on page segmentation – layout elements. We mark: • marking paragraphs • marking columns • marking headers • marking images • marking background images • marking table Other notes: _______ OCR: editing reading order of recognized layout elements • yes • no Other notes: _______ OCR: additional work on recognised text • fixing OCR mistakes (wrongly recognized characters, words, decorative initial or any other mistakes) • no Other notes: _______ 4. ADDITIONAL PROCESSING Did you use any other tools, software to enhance the quality of the results? For example: marking the final PDF with semantic tags or any other solutions. _______ 5. EXPORT Any additional comments? The exported formats are XML, Text and PDF containing the recognized text. The last is served to the users. 114 Document Outline DOCUMENT INFORMATION HISTORY OF VERSIONS EODOPEN PROJECT SUMMARY ABSTRACT TABLE OF CONTENTS LIST OF TABLES 1. Introduction 1.1. Purpose 1.2 Description of the Report 1.3 Explanation of the key concepts 1.4 Context description: User needs for delivery formats 2 Evaluation of delivery formats: Trial implementation 2.1 Background 2.2 Methodological approach 3 Test results 3.1 Results by sample scan number 3.2 Results according to criteria 4 Test findings 5 Possible solutions and recommendations 5.1 Solutions for mobile devices 5.2 Solutions for print-disabled users 6 Summary 7 Reference 8 Vocabulary 9 Used acronyms 10 Annexes Annex 1. Testing samples Scan 1 Scan 2 Scan 3 Scan 4 Scan 5 Scan 6 Scan 7 Scan 8 Scan 9 Scan 10 Scan 11 Scan 12 Scan 13 Scan 14 Scan 15 Scan 16 Annex 2: Testing report questionnaire Annex 3. Testing report questionnaires by partners P1 - UIBK, University of Innsbruck – RTF format P1 - UIBK, University of Innsbruck – ODM workflow – PDF and RTF format P2 - UT, University of Tartu P3 - NUK, National and University Library – usual workflow P3 - NUK, National and University Library – PDF edited with Adobe Acrobat Pro P3 - NUK, National and University Library – PDF made with Abbyy FineReader 15 P3 - NUK, National and University Library – PDF and ePUB made from Word P4 - MZK, Moravian Library – small edited P4 - MZK, Moravian Library – edited P5 - UG, University of Greifswald P6 - NLS, National Library of Sweden P7 - NCU, Nicolaus Copernicus University in Torun P9 - VKOL, Research Library Olomouc P10 - BNP, National Library of Portugal – PDF P10 - BNP, National Library of Portugal – docx P11 - NLE, National Library of Estonia P12 - OSZK, National Széchényi Library P13 - CVTI SR, Slovak Centre of Scientific and Technical Information P14 - UREG, University of Regensburg