34
Hill	Museum	and	Manuscript	Library	(n.	d.)	Reading Room.	Retrieved	at	https://
www.vhmml.org/readingRoom/view/139831	(accessed	on	5.	3.	2024).
Identità.	(2024).	Public Registry.	Retrieved	at	https://identita.gov.mt/public-regis-
try-main-page/	(accessed	on	3.	3.	2024).	
Malta Diocesan Archives. (2018). Malta Parish Archives.	 Retrieved	 at	 https://
www.maltaparisharchives.org/	(accessed	on	5.	3.	2024).
Romanova,	E.	(2019).	Archival	Science:	Bridges	between	tradition	and	innova-
tion. Atlanti, 29(1), 17–27.
Serracino, C. (2022). Ardet Amans. Essays in honour of Horatio Caesar Roger 
Vella.	Malta:	Midsea	Publishers.
Tatò,	G.	(2019).	Archives	and	the	Society.	In	Atlanti, 29(2),	95–101. 
The	Malta	Independent.	(5.	8.	2015).	National Archives of Malta announce com-
pletion of the digitisation of the magnia curia castellaniae.	 Retrieved	 at	
https://www.independent.com.mt/articles/2015-08-05/local-news/Nation-
al-Archives-of-Malta-announce-completion-of-the-digitisation-of-the-mag-
nia-curia-castellaniae-6736140033	(accessed	on	3.	3.	2024).
IT AND ARCHIVES: CHALLENGES, THREATS, AND OPPORTUNITIES	CHARLES	J.	FARRUGIA
35
Miroslav Milovanović1
ZERO SHOT CLASSIFICATION FOR 
UNSTRUCTURED TEXT OF ARCHIVAL VALUE
Abstract
Purpose: The purpose of the article is to investigate if artificial intelligence and, 
subsequently, machine learning can provide any solutions to ease some of the 
archival tasks when dealing with classification of unstructured texts which have 
archival value. The research was aimed specifically on how to approach a specif-
ic archival task within content classification of unstructured texts.
Method/approach: In the research, the methods of content analysis and experi-
ment were used. Different approaches to managing the classification of unstruc-
tured text with the use of machine learning were investigated, as well as the 
conduction of experiment testing of some of the most prominent technological 
solutions currently available. 
Results: The research showed that the use of machine learning for the purpose 
of classification in managing unstructured text with archival value is achievable 
and effective.
Conclusion: The approach, with its method and technology, which was used in 
the research is mature, manageable, and available to carry out the archival task 
of classification of unstructured text where needed. Zero shot classification pro-
vides a suitable path to solve problems relating to the classification of unstruc-
tured texts of archival value where pre-labelled data for following the supervised 
approach to create the model for classification is not available.
Key words: Machine learning, unstructured text, classification, zero shot classi-
fication, description. 
1	 Miroslav	Milovanović,	PhD	student	of	Archival	Sciences	at	Alma	Mater	Europaea	University,	Slovenia,	e-mail:	
miroslav.milovanovic1@almamater.si.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
36
INTRODUCTION
One	of	the	biggest	problems	in	dealing	with	unstructured	texts	which	could	have	
archival	value	is	how	to	handle	the	ever-growing	amount	of	individual	unstruc-
tured	records	using	archival	practice.	While	certain	technology	and	methods	exist	
that	could	be	used	for	executing	some	of	the	individual	tasks	when	dealing	with	
unstructured	texts,	it	is	still	hard	to	find	or	develop	an	organized	approach	in	the	
form	of	a	model	or	a	guideline	as	how	to	proceed	to	achieve	completeness	when	
doing	certain	archival	tasks,	such	as	arranging	unstructured	texts	or	providing	
archival	description	to	individual	unstructured	texts.	Novak	(2019)	exposes	that	
using	modern	 information	 technology	 solutions	 in	 archival	 professional	work,	
requires	an	approach	which	uses	many	ad	hoc	skills	 that	are	currently	hard	 to	
acquire	through	formally	established	educations	and	trainings.
Unstructured	 text	 has	 no	 predefined	 structure,	 such	 as	 format	 or	 data	model.	
Unstructured	text	thus	represents	any	form	of	record	that	can	include	any	infor-
mation.	Because	this	type	of	data	is	not	organized	in	a	predetermined	way,	it	is	
more	difficult	to	process	and	analyse	when	using	traditional	methods	(OpenText	
Corporation, 2024).
In	recent	years	many	organisations	started	with	their	digital	 transformation	pro-
cesses	resulting	in	the	creation	of	a	large	number	of	digital	records.	The	exponential	
growth	of	 the	number	of	digital	 records	 thus	exposes	numerous	problems	when	
dealing	with	those	records	such	as	the	aforementioned	arrangement	of	unstructured	
texts	or	providing	an	archival	description	to	an	individual	unstructured	text.	Bur-
gener	and	Rydning	(2022)	have	outlined	that	an	ever-growing	number	of	digital	re-
cords	will	be	dominated	by	unstructured	data	in	the	proportion	of	up	to	90%	which	
is	created	on	a	yearly	basis,	and	the	proportion	will	just	keep	increasing.
There	are	several	approaches	available	as	to	how	to	proceed	when	handling	un-
structured	data	but,	when	proceeding	with	such	approaches,	certain	challenges	
should	be	addressed	and	taken	into	consideration,	such	as	(OpenText	Corpora-
tion,	2024;	Baig,	2023):
-	 Accessibility	and	usability	of	unstructured	data:	The	rapid	evolution	of	infor-
mation	technologies	and	diverse	formats	may	impact	the	readability	of	data,	
posing	a	challenge	in	maintaining	its	usefulness	for	subsequent	processing.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
37
-	 Efficient	handling	of	vast	data	volumes:	Managing	the	exponential	growth	of	
unstructured	data	poses	the	challenge	of	processing	and	capturing	informa-
tion promptly to prevent potential losses.
-	 Complex	indexing	and	classification:	The	diverse	forms	and	unknown	con-
tents	of	records	make	indexing	and	classification	a	demanding	process	prone	
to	errors,	significantly	affecting	the	quality	of	the	obtained	results.	
-	 Security	 challenges:	 Safeguarding	 confidential	 data	 during	 processing	 be-
comes	 intricate,	 as	 this	 information	 can	 swiftly	 proliferate	 across	 diverse	
record	 formats	 and	 storage	 locations,	 leading	 to	 difficulties	 in	 identifying	
sensitive content.
-	 Support	for	diverse	record	formats:	Unstructured	data	lacks	predetermined	
standard	record	formats,	complicating	data	processing	by	requiring	versatile	
solutions	to	handle	various	types	of	formats	effectively.
-	 Requirement	for	specialised	resources	and	expertise:	Unstructured	data	con-
stitutes	a	majority	of	created	material	 today,	necessitating	robust	hardware	
for	efficient	processing	and	skilled	personnel,	often	referred	to	as	“data	sci-
entists,”	capable	of	devising	appropriate	solutions	for	handling	unstructured	
data.
-	 Considerable	expense	in	establishing	unstructured	data	processing	systems:	
Beyond	hardware	and	human	resources,	the	cost	of	additional	components,	
including	specialised	software,	data	storage	equipment,	and	measures	related	
to	information	security,	must	be	considered	when	setting	up	systems	for	pro-
cessing	unstructured	data.
ARTIFICIAL INTELLIGENCE AND ARCHIVING
There	are	several	definitions	of	the	concept	of	artificial	intelligence	but,	general-
ly,	artificial	intelligence	could	be	defined	as	“a science whose goal is to make a 
machine that will do things which require human intelligence”	(Balič,	2004)	and	
as	a	system	that	can	design	or	execute	independently	without	human	intervention	
(Barredo	et	al.,	2020).	Klasinc	(2023)	defines	the	use	of	artificial	intelligence	in	
Archival	science	as	collective	solutions	“that help in generating and managing 
archival content, context and other relations established in archival material”.
There	are	several	advantages	and	disadvantages	of	using	artificial	 intelligence.	
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
38
Some	of	the	advantages	that	can	help	in	the	field	of	archiving	are	“task	automa-
tion”	where	faster	task	execution	can	provide	a	solution	for	mass	data	handling;	
no	overload	or	stress	when	executing	tasks;	an	ability	to	perform	several	tasks	
at	 the	 same	 time,	 low	 costs	 in	 relation	 to	 the	work	 undertaken;	 possibility	 to	
discover	relations,	connections	and	patterns	in	previously	unknown	content	etc,	
(Khanzode	&	Sarode,	2020;	Bhosale,	2020).	While	there	are	some	clear	advan-
tages	to	using	artificial	intelligence,	there	are	also	some	disadvantages	that	need	
to	be	addressed	when	deciding	if	using	such	a	system	is	appropriate,	such	as:	a	
significantly	higher	inaccuracy	given	the	potential	errors	when	executing	tasks;	
dependency	and	 subjectivity	with	 regards	 to	 the	 rules	designed	by	 the	 system	
architect	(creativity	and	vision);	the	potential	high	cost	in	the	development	and	
implementation;	dependence	on	 specific	 technology;	 impact	on	a	need	 for	hu-
man	resources;	potential	abuse	and	unethical	use	etc.,	(Khanzode	&	Sarode	2020;	
Bhosale, 2020).
ARTIFICIAL INTELLIGENCE AND ETHICS
Ethics	within	the	field	of	artificial	intelligence	primarily	involves	assessing	the	
implications	 and	 potential	 outcomes	 associated	with	 the	 development	 and	 de-
ployment	of	artificial	intelligence	(Boddington,	2023).	One	significant	consider-
ation	is	how	the	advancement	of	artificial	intelligence	may	impact	the	demand	
for	 human	 labour	 and	 influence	 the	 nature	 of	work	 (Kumar	 et	 al.,	 2021).	The	
archivist’s	discretional	approach	as	someone	who	evaluates	the	archival	material	
and	subsequently	carries	out	the	archival	tasks	is	one	of	the	most	important	things	
for	quality	assurance	for	the	long	term	preservation	of	archival	material.	Even	if	
artificial	intelligence	is	used	for	such	tasks,	the	human	factor	(the	archivist)	is	still	
needed	to	initially	develop	and	later	evaluate	the	execution	of	expert	systems	and	
results	gathered.
There	are	several	topics	that	should	be	taken	into	consideration	when	developing	
solutions,	including	artificial	intelligence,	with	regard	to	the	execution	of	archival	
tasks.	Some	of	the	topics	can	include	how	to	handle	or	manage	privacy,	respon-
sibility,	trust,	continuity	and	sustainability,	dignity,	solidarity,	transparency	and	
availability,	freedom	and	autonomy	to	make	decisions	and	the	provision	of	guide-
lines	to	enforce	harmless	execution	(Boddington,	2023).
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
39
MACHINE LEARNING
Machine	learning	is	a	specific	application	of	artificial	 intelligence.	According	to	
Murty	and	Avinash	(2023),	machine	learning	is	considered	a	mature,	dynamic	and	
crucial	field	that	has	evolved	over	more	than	six	decades.	The	accelerated	expansion	
of	machine	learning	can	be	attributed,	in	part,	to	the	recent	surge	in	the	availability	
of	machine-processable	data	and	the	enhancement	and	accessibility	of	hardware	
which	is	capable	of	efficiently	processing	substantial	amounts	of	data.
At	its	core,	machine	learning	relies	on	the	processing	of	data	to	facilitate	learning,	
making	the	format	of	the	data	a	crucial	factor.	Data	can	be	broadly	categorised	into	
structured,	unstructured,	semi-structured,	and	descriptive	data	or	metadata.	Further-
more,	the	quality	and	appropriateness	of	the	input	data	play	a	vital	role,	influencing	
the	approach	and	expected	outcomes.	Inadequate	data	or	data	containing	extraneous	
information	can	significantly	impact	the	expected	results	(see	Caliskan	et	al.,2017).
There	are	many	approaches	 to	 the	use	of	machine	 learning,	 four	of	which	are	
the	most	commonly	used	today	(Sarker,	2021):	supervised	learning,	which	relies	
on	data	processing	in	respect	to	the		relationship	between	input	and	output	data,	
utilising	pre-processed	data	which	is	readily	available	for	learning	and	training;	
unsupervised	 learning	which	 involves	data	processing	without	prior	manipula-
tion	or	human	intervention;	semi-supervised	learning	which	combines	elements	
of	 both	 supervised	 and	 unsupervised	 approaches,	 utilising	 both	 pre-processed	
and	non-pre-processed	data	in	processing;	and	reinforcement	learning	which	is	a	
machine learning approach facilitating the automatic assessment of optimal be-
haviour	within	a	given	context	or	environment	to	enhance	performance.
NATURAL LANGUAGE PROCESSING 
One	of	the	domains	where	machine	learning	is	extensively	used	is	“natural	lan-
guage	processing”	(NLP).	Eisenstein	(2018)	draws	a	direct	comparison	between	
natural	language	processing	and	the	term	“computational	linguistics.”	Despite	a	
significant	overlap,	a	distinction	persists	between	the	two	definitions.	“Computa-
tional	linguistics”	primarily	emphasises	“linguistics,”	wherein	various	forms	of	
computer	processing	play	a	supportive	role,	while	in	the	case	of	natural	language	
processing,	the	focus	is	on	the	design	and	analysis	of	computer	algorithms	and	
approaches for natural human language processing.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
40
The	primary	objective	of	natural	language	processing	is	to	provide	new	comput-
ing	 capabilities	 related	 to	 human	 language,	 including	 tasks	 such	 as	 extracting	
information	from	texts,	language	translation,	question	answering	and	engaging	
in	conversations	etc.	Khurana	et	 al.	 (2022)	define	natural	 language	processing	
(NLP)	as	a	branch	of	artificial	intelligence	and	linguistics	dedicated	to	enabling	
computers	to	comprehend	statements	or	words	written	in	human	languages.	Nat-
ural	language	processing	was	developed	to	simplify	user	tasks	and	to	fulfil	the	
desire	to	communicate	with	computers	in	a	natural	language.	Additionally,	the	
authors	categorise	the	field	of	natural	language	processing	into	the	“understand-
ing	of	natural	language”	and	“natural	language	creation.”
Some of the most common uses of natural language processing using machine 
learning	 can	 include	 (McMullen,	 2023):	 text	 summarisation,	 automated	 chat	
rooms,	machine	translation,	classification	of	 texts,	answering	questions,	recog-
nition	of	named	entities,	creation	of	natural	 language,	discerning	 the	meaning	
of	words,	sentiment	analysis,	speech	recognition	and	connection	of	entities,	etc.
RESEARCH 
Managing	extensive	digital	 content	with	unknown	and	unstructured	data	with	
regards	to	identifying	potential	archival	value	poses	a	significant	challenge.	To	
provide	swift,	efficient	and	accurate	handling	of	content	with	archival	tasks,	such	
as	providing	descriptions	or	arranging	records,	becomes	problematic,	particular-
ly	when	attempting	to	classify,	edit	or	list	such	vast	quantities	of	digital	material	
and	when	not	undertaken	immediately	may	lead	to	a	poor	preservation	process	
and	an	inferior	quality	in	respect	of	the	records	retained	(Popovici,	2022).
When	dealing	with	unstructured	and	content-ambiguous	records,	there	is	a	likeli-
hood	that	numerous	records	may	not	be	worth	retaining	or	preserving	(Moss	and	
Gollins,	2017).	Therefore,	it	is	essential	to	devise	an	approach	for	identifying	and	
preserving	those	records	of	archival	value	(Grigory,	2023)	through	content	clas-
sification	and	evaluation.	This	involves	ensuring	the	accessibility	and	usability	of	
the	retained	material	while	distinguishing	it	from	what	should	be	discarded.
The	research	presented	here	outlines	the	“zero	shot	classification”	(ZSC)	approach	
to	classifying	unstructured	texts	with	archival	value	using	machine	learning.	The	
object	of	the	research	was	also	to	determine	how	the	ZSC	compares	to	a	more	guid-
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
41
ed	approach	of	building	a	decision	model	based	on	a	learning	process	and	if	it	is	
possible	to	create	an	implementation	model	for	the	processing	of	unstructured	texts.	
An	example	that	was	directly	covered	in	the	research	was	“text	classification”.
ZERO SHOT CLASSIFICATION
Zero	shot	classification	relates	to	an	approach	to	achieve	a	classification	through	
a	prediction	of	a	classes	 that	were	not	part	of	 the	 initial	 training	of	 the	model	
(Hugging	Face,	s.	d.	b).	ZSC	still	uses	a	pre-trained	model	to	achieve	its	goal,	but	
the	aim	is	to	provide	an	approach	which	can	achieve	a	task	of	classification	where	
training	data	for	supervised	classification	is	scarce.
There	is	one	problem	in	particular	that	can	present	a	serious	obstacle	to	building	
a	dedicated	classification	model	and	that	is	the	lack	of	quality	labelled	data	set	
for	training,	especially	in	the	case	of	supervised	classification.	This	is	even	more	
highlighted	when	we	are	dealing	with	multilanguage	content.	Generally,	the	ZSC	
relies	on	learning	to	recognise	the	layer	of	semantic	attributes	while	building	the	
model	which	can	be	used	to	identify	classes	that	were	not	visible	in	the	training	
process	of	the	pre-trained	model	(Alcoforado	et	al.,	2022).
There	are	several	practical	cases	where	ZSC	can	be	used,	given	that	there	are	no	
required	prerequisites	apart	from	the	pretrained	model	such	as:	categorisation	or	
topic	 classification,	 identification	of	 intent,	 sentiment	 analysis	 and	 even	 image	
classification.	 Some	 of	 the	 advantages	 that	 ZSC	 could	 bring	 to	 the	 aforemen-
tioned	tasks	are	mostly	related	to	the	optimisation	of	the	processes,	such	as	the	
time	required	to	achieve	the	tasks,	flexibility	with	regards	to	the	inclusion	of	new	
material	and	independence	regarding	the	form	of	data	etc.
There	are	also	some	disadvantages	which	could	impact	the	decision	to	use	ZSC	
as	an	approach	to	undertake	archival	tasks	such	as	the	quality	of	the	pre-trained	
model	and	connected	class	descriptions,	extreme	variation	between	the	content	
and	classes	which	were	used	for	pre-training	the	model	data	and	data	which	is	
intended	for	zero	shot	classification.	One	of	the	biggest	drawbacks	in	using	ZSC	
is	also	how	to	provide	a	tangible	evaluation	process	with	the	intention	of	measur-
ing	the	performance	of	the	zero-shot	learning	process,	as	there	are	no	preexisting	
labels	 that	 could	 provide	 any	 type	 of	 quantification	 such	 as	 in	 the	 supervised	
classification	approaches	(Xian	et	al.,	2020).
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
42
Since	ZSC	is	not	a	supervised	approach	there	are	also	concerns	in	using	such	an	ap-
proach	in	respect	of	tasks	which	may	include	ethical	considerations.	Given	that	the	
general	guideline	of	the	ZSC	model	relies	heavily	on	the	pre-trained	model,	certain	
points	need	to	be	taken	into	consideration	like	“content	subjectivity”		with	an	un-
suitable	class	description	process,	which	does	not	correctly	capture	the	relationship	
between	model	decisions	and	the	intended	architectural	layout,	“biased	decisions”	for	
the	same	reason	given	the	unsuitable	class	description,	“misclassification”	with	re-
gards	to	not	understanding	the	context	and	“privacy	concerns”	etc.	(Van	Otten,	2023).
An	inherent	limitation	of	utilising	the	ZSC	approach	can	also	be	in	their	narrow	
focus	when	declaring	the	hypothesis	to	implement	the	classification.	If	these	ap-
proaches	fail	to	detect	the	specific	semantics	they’re	trained	to	identify	initially,	
it’s	 unlikely	 they’ll	 retrieve	 it	 any	 time	 later.	 This	 outlines	 the	 importance	 of	
maintaining	consistency	and	clear	instructions	in	the	criteria	creating	the	hypoth-
esis.	Any	alterations	from	its	concise	meaning	must	be	carefully	considered	to	
avoid	disrupting	the	coherence	of	understanding	as	to	what	it	needs	to	search	for	
and	how	it	should	understand	the	similarities	between	classes.	Utilising	machine	
learning	and	ZSC	can	provide	the	basis	for	providing	autonomously	established	
rules	for	assessing	content	relevance,	given	the	use	of	a	suitable	pre-trained	mod-
el.	 This	 facilitates	 the	 possibility	 of	 identification	 and	 segregation	 of	material	
with	potential	archival	significance.
EXPERIMENT
The	purpose	of	the	research	was	to	assess	the	efficiency	of	ZSC	on	individual	text	
records	when	compared	to	the	more	controlled	and	supervised	approach	to	text	
classification.
The	data	which	was	used	to	execute	the	ZSC	were	publicly	available	unstruc-
tured	news	articles	in	the	English	language,	aggregated	in	the	machine-reada-
ble	textual	form.	
The	following	architecture	was	used	to	test	the	zero	shot	classification	approach	
and	compared	with	the	variant	of	the	supervised	classification	approach:
-	 1000	 individual	 records,	 each	 record	 containing	one	news	 article	 (Guardian	
Media	Group,	2023);	all	individual	records	were	pre-labelled	under	three	sec-
tions	or	topics	which	were	named:	“government”,	“business”	and	“sports”.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
43
-	 For	 Zero	 Shot	 Classification,	 the	 pre-trained	 model	 “Roberta-large-mnli”	
(Hugging	Face,	s.	d.	a)	was	used.
-	 For	 the	 comparison	 classification	using	 the	 supervised	 approach	pre-trained	
model,	the	“Bert-base-cased”	(Hugging	Face,	s.	d.)	model	was	used.
The	workflow	for	executing	the	classification	was	divided	in	two	parts.	The	first	
one	being	the	Zero	Shot	Classification	approach,	which	directly	executed	predic-
tions,	and	the	second	one	with	the	supervised	approach	including	the	additional	
“learning	step”	and,	subsequently,	the	creation	of	a	decision	model	to	make	pre-
dictions.	Since	the	ZSC	method	does	not	include	the	“learning”	step,	the	evalu-
ation	of	efficiency	would	prove	to	be	difficult	as	such	so	for	valuation	purposes,	
the	data	was	pre-labelled	with	three	topics.
The	Zero	Shot	Classifier	parametrisation,	with	the	use	of	a	pretrained	model,	was	
as	follows:
-	 1000	individual	records	as	input	for	executing	the	Zero	Shot	Classification.	
-	 A	classifier	was	executed	on	each	individual	record.
-	 Content	of	each	individual	record	was	in	its	original,	unstructured	form.
-	 Candidate	 labels	 of	 expected	 classifying	 classes	were	 “government”,	 “busi-
ness”	and	“sports”.
-	 Custom	hypothesis	was	used	with	the	following	narrative	“The	topic	of	 this	
content	is	{}”
-	 Batch	size:	10.
-	 1000	individual	records	as	output	with	assigned	predictions	on	individual	re-
cords	in	respect	of	the	pre-labelled	data
-	 Analysis	and	quantified	distances	(efficiency)	between	pre-labelled	classes	and	
predicted	classes.
The	supervised	classification	approach,	using	pre-trained	model	parametrisation,	
was	as	follows:
-	 1000	 individual	 records	 as	 input	 for	 executing	 the	 learning	process	 and	 the	
creation	of	a	decision	model	for	classification.	
-	 A	classifier	was	executed	on	each	individual	record.
-	 Content	of	each	individual	record	was	in	its	original	unstructured	form.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
44
-	 The	classification	learner	included	data	(individual	records)	which	were	pre-la-
belled	with	one	of	the	following	classes	(labels):	“government”,	“business”	and	
“sports”.
-	 Maximum	sequence	length:	512.
-	 Number	of	epochs:	8.
-	 Batch	size:32.
-	 Validation	batch	size:20.
-	 Optimiser:	Adam;	Learning	rate:	0.001.
-	 1000	individual	records	as	output	with	assigned	predictions	on	individual	re-
cords	in	respect	of	the	pre-labelled	data
-	 Analysis	and	degree	of	difference	(efficiency)	between	pre-labelled	classes	and	
predicted	classes.
RESULTS
Zero Shot Classification
The	Zero	Shot	Classification	result	was	as	follows:
-	 71%	accuracy	in	respect	of	the	pre-labelled	data.
-	 0.57	Cohen’s	Kappa.
Figure 1: Confusion matrix ZSC (Knime, 2024)
Supervised classification approach
The	supervised	classification	result	was	as	follows:
-	 97%	accuracy	in	respect	of	the	pre-labelled	data.
-	 0.96	Cohen’s	Kappa.
Figure 2: Confusion matrix supervised classification (Knime, 2024)
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
45
DISCUSSION AND CONCLUSIONS 
Discussion
Results	for	 the	ZSC	approach	show	that	 in	 the	case	of	classification,	where	no	
explicit	learning	process	was	involved	and	based	on	a	generic	pre-trained	model,	
an	accuracy	of	71%	was	recorded.	This	in	turn	represents	a	good	result	given	that	
the	only	“fine	tuning”	that	was	undertaken	was	where	hypotheses	and	expected	
labels	 (classes)	were	provided	without	an	actual	pre-understanding	of	 the	data	
that	was	intended	for	classification.	Because	the	ZSC	approach	is	generally	aimed	
at	classifying	data	that	can’t	be	used	for	a	supervised	learning	process	of	classi-
fication,	it	does	provide	a	good	way	to	solve	problems	classifying	unstructured	
data	with	potential	archival	value	where	content	is	unknown.
The	confusion	matrix	(Figure	1),	in	turn,	shows	where	discrepancies	were	identi-
fied.	While	the	“sports”	class	showed	certain	deviations	towards	being	misclassi-
fied	as	“government”	it	was	the	“business”	class	that	showed	the	biggest	deviation	
with	regards	to	the	“government”	class,	as	it	was	in	most	cases	classified	as	“gov-
ernment”	instead	of,	“business,”	as	expected.	This	may,	in	a	way,	be	explained	by	
the	fact	that	the	content	in	individual	records,	which	were	pre-labelled	with	those	
classes	mentioned,	 is	very	similar	between	 two	classes	and,	as	 such,	provided	
the	biggest	challenge	to	make	a	suitable	distinction.	This	deviation	could	also	be	
attributed	to	the	process	of	using	different	weights	on	the	side	of	the	pre-labelling	
process	where	“pre-classification”	has	already	been	undertaken	for	the	evaluation	
purposes	where	the	classifier	could	use	a	slightly	different	approach	thus	produc-
ing the contrasting results seen using the ZSC approach.
While	the	ZSC	classifier	went	through	the	process	of	classification	without	the	
additional	learning	activity	provided	for	input	data,	this	was	not	the	case	in	the	
second	approach	creating	a	decision	model,	based	on	additional	activity	of	learn-
ing	on	the	actual	content	that	was	intended	for	the	classification	prior	to	the	final	
classification	task.	Because	the	second	approach	included	the	learning	phase	on	
the	input	data	provided,	accuracy	was	measured	at	97%.	This	certainly	represents	
an	excellent	result,	but	this	could	only	be	achieved	while	having	the	pre-labelled	
data	available	for	training	purposes.	When	dealing	with	unstructured	texts	with	
potential	archival	value,	such	availability	or	un-availability	of	pre-labelled	data	
may prove to be one of the biggest obstacles for using this approach. 
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
46
The	confusion	matrix	 (Figure	2)	 in	 the	second	approach	shows	 that	 the	 learn-
ing	process	was	crucial	for	eliminating	confusion	regarding	content	similarity.	It	
shows	that	understanding	content	and	its	attributed	values	before	actual	classifi-
cation	does	provide	for	a	more	coherent	way	to	accept	decisions.		
Conclusions
Zero	Shot	Classification	does	provide	a	useful	approach	when	dealing	with	un-
known	content	which	has	potential	archival	value	and	with	no	accessible	means	
to	train	the	decision	model	based	on	pre-labelled	data.	It	does	exhibit	certain	dis-
advantages	when	compared	to	the	supervised	classification	approach,	but	it	also	
shows	certain	advantages	in	specific	real-life	scenarios.	There	are	many	archives	
and	many	creators	which	could	benefit	from	the	use	of	ZSC	for	providing	usable	
content	in	accessing	archival	records,	such	as	a	virtual	archive	reading	room	(Sa-
badin,	2023)	or	any	other	digital	platform	accessible	to	the	public.	
With	the	proliferation	of	digital	content	creation	and	usage,	the	need	for	effective	
management	of	mass,	unstructured	data	of	potential	archival	value	has	become	
paramount.	However,	alongside	this	challenge	comes	the	dilemma	of	determin-
ing	what	content	to	capture,	when	to	capture	it,	and	how	to	do	it,	raising	questions	
about	the	necessity	of	capturing	all	digital	material	or	just	a	selection.	Given	the	
vast	volume	of	digital	content	being	generated,	 it’s	 imperative	 to	establish	cri-
teria	 for	 evaluating	material,	 identifying	what	warrants	 long-term	 retention	or	
preservation	as	archival	material.	Adaptations	and	enhancements	 to	evaluation	
approaches	 are	 essential	 to	 accommodate	 the	 sheer	 volume	of	 digital	 content,	
ensuring	that	management	processes	maintain	their	quality	without	compromis-
ing	efficiency.	To	address	these	challenges,	it	is	crucial	to	exert	control	over	new	
methods	and	technologies	used	for	evaluating	the	value	of	potential	archival	con-
tent	 and	 ensuring	 transparency	 throughout	 the	 process	which	 uses	 such	 tech-
nologies.	Additionally,	clear	procedures	and	mechanisms	must	be	established	to	
ensure	compliance	with	archival	regulations	and	standards,	fostering	a	clear	and	
comprehensive	environment	conducive	to	effective	management	of	unstructured	
text	when	using	machine	learning	approaches	to	classify	records.	
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
47
REFERENCES
Alcoforado,	A.	&	Ferraz,	T.	P.	&	Gerber,	R.	&	Bustos,	E.	&	Oliveira,	A.	S.	&	
Veloso,	B.	M.	&	Siqueira,	F.	L.		&	Costa,	A.	H.	R.	(2022).	ZeroBERTo:	Lev-
eraging	Zero-Shot	Text	Classification	by	Topic	Modeling.	In	P.	Gamallo,	R.	
Amaro,	C.	Scarton,	F.	Batista,	D.	Silva,	C.	Magro	&	H.	Pinto	(eds),	PROPOR 
2022:	Computational Processing of the Portuguese Language (pp.	125-136). 
Fortaleza,	Brazil:	Springer.
Baig,	J.	(2023).	Unstructured Data Challenges for 2023 and their Solutions.	Re-
trieved	 at	 https://www.astera.com/type/blog/unstructured-data-challenges/	
(accessed	on	10.03.2024).
Balič,	J.	(2004).	Inteligentni obdelovalni sistemi.	Maribor:	Faculty	of	mechanical	
engineering, University of Maribor.
Barredo,	A.	A.	&	Díaz-Rodríguez,	N.	&	Del	Ser,	J.	&	Bennetot,	A.	&	Tabik,	S.	&	
Barbado,	A.	&	Garcia,	S.	&	Gil-Lopez,	S.	&	Molina,	D.	&	Benjamins,	R.	&	
Chatila,	R.	&	Herrera,	F.	(2020).	Explainable	Artificial	Intelligence	(XAI):	
Concepts,	taxonomies,	opportunities	and	challenges	toward	responsible	AI.	
Information Fusion,	58,	82–115.
Bhosale,	S.	&	Pujari,	V.	&	Multani,	Z.	 (2020).	Advantages	and	Disadvantages	
of	Artificial	Intelligence.	Aayushi International Interdisciplinary Research 
Journal, 77, 227–230.
Blei,	M.	D.	&	Ng,	A.	Y.	&	Jordan,	M.	I.	(2003).	Latent	Dirichlet	Allocation.	Jour-
nal of Machine Learning Research 3(2003),	993–1022.	Retrieved	at	http://
www.jmlr.org/papers/volume3/blei03a/blei03a.pdf	(accessed	on	10.03.2024).
Boddington,	P.	(2023).	AI Ethics.	Singapore:	Springer	Nature	Singapore	Pte	Ltd. 
Burgener,	E.	&	Rydning,	J.	(2022).	High	Data	Growth	and	Modern	Applications	
Drive	New	 Storage0	Requirements	 in	Digitally	 Transformed	 Enterprises.	
IDC White paper.	Retrieved	 at	 https://www.delltechnologies.com/asset/en-
my/products/storage/industry-market/h19267-wp-idc-storage-reqs-digital-
enterprise.pdf	(accessed	on	10.03.2024).
Caliskan,	A.	&	Bryson,	J.	J.	&	Narayanan,	A.	(2017).	Semantics	derived	automat-
ically from language corpora contain human-like biases. Science, 356(6334),	
183–186.
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ
48
Eisenstein,	J.	(2018).	Natural language processing.	MIT	press.	Retrieved	at	https://
cseweb.ucsd.edu/~nnakashole/teaching/eisenstein-nov18.pdf	 (accessed	 on	
10.03.2024).
Grigory,	L.	N.	 (2023).	Archival	 science	 in	 the	 postindustrial	 society.	Atlanti+, 
33(1),	38–44.	Retrieved	at	https://journal.almamater.si/index.php/atlantiplus/
issue/view/41	(accessed	on	10.03.2024).		
Guardian	 Media	 Group.	 (2023). The Guardian Open Platform.	 Retrieved	 at	
https://open-platform.theguardian.com/	(accessed	on	10.03.2024).		
Hugging	Face	(s.	d.). Bert-base-cased.	Retrieved	at	https://huggingface.co/goog-
le-bert/bert-base-cased	(accessed	on	12.03.2024).	
Hugging	Face.	(s.	d.	a). Roberta-large-mnli.	Retrieved	at	https://huggingface.co/
FacebookAI/roberta-large-mnli	(accessed	on	12.03.2024).	
Hugging	Face.	(s.	d.	b). Zero shot classification.	Retrieved	at	https://huggingface.
co/tasks/zero-shot-classification	(accessed	on	10.03.2024).
Khanzode,	C.	A.	&	Sarode,	R.	D.	(2020).	Advantages	and	disadvantages	of	ar-
tificial	intelligence	and	machine	learning:	a	literature	review.	International 
Journal of Library & Information Science (IJLIS), 9(1),	30–36.
Khurana, D. & Koli, A. & Khatter, K. & Singh, S. (2022). Natural Language 
Processing:	State	of	The	Art,	Current	Trends	and	Challenges. Multimedia 
Tools and Applications, 82(6).	 Retrieved	 at	 https://www.researchgate.net/
publication/319164243_Natural_Language_Processing_State_of_The_Art_
Current_Trends_and_Challenges	(accessed	on	10.03.2024).
Klasinc,	P.	P.	 (2023).	Archivistics,	Archival	 science	and	Artificial	 intelligence.	
Atlanti+, 33(2),	 25–36.	Retrieved	 at	 https://journal.almamater.si/index.php/
atlantiplus/issue/view/42/31	(accessed	on	22.05.2024).	
Knime. (2024). Knime.	Retrieved	at	https://www.knime.com/	(accessed	on	12.03.2024).
Kumar,	P.	&	Jain,	V.	K.	&	Kumar,	D.	(2021).	Artificial Intelligence and Global 
Society.	Boca	Raton:	Taylor	&	Francis	Group,	LLC.	
McMullen,	M.	(11.	5.	2023).	11	NLP	Use	Cases:	Putting	the	Language	Compre-
hension	Tech	to	Work.	Readwrite. Retrieved	at	https://readwrite.com/11-nlp-
use-cases-putting-the-language-comprehension-tech-to-work/	 (accessed	 on	
10.03.2024). 
ZERO SHOT CLASSIFICATION FOR UNSTRUCTURED TEXT OF ARCHIVAL VALUE	MIROSLAV	MILOVANOVIĆ