Informatica 42 (2018) 127–136 127
Prediction of Sentiment from Macaronic Reviews
Sukhnandan Kaur and Rajni Mohana
Department of CSE, JUIT, Waknaghat, 173234, India
E-mail: sukhnandan.kaur@mail.juit.ac.in, rajni.mohana@juit.ac.in
Technical paper
Keywords: macaronic language, sentiment analysis, supervised learning, normalization
Received: March 11, 2017
Web-sphere is the vast ocean of data. It allows its users to write their opinion, suggestions over various
social platforms. The users often prefer to write in their native language or some hybrid content (i.e.,
combination of two or more languages). It’s also observed that people use a word or two of their native
language in a text of base language. The presence of native words along with base language is known as
macaronic languages. For example: Dunglish (Dutch and English), Chinglish (Chinese and English), Hin-
glish (Hindi and English) The use of macaronic languages over the web is on the rise these days. This type
of text generally doesn’t follow any syntactic structure, thus making processing of the content difficult.
This paper deals with extracting meaningful information of a text containing macaronic content. It also
facilitates the need of expert analysers for the processing of such content to take effective decisions. The
performance of various decision support systems is dependable over these analysers. Therefore, this paper
presents an algorithm which initially normalizes the content to its base language; later performs sentiment
analysis over it. The experimental results using proposed algorithm indicates a trade-off between various
performance aspects.
Povzetek: Prispevek predstavi iskanje razumevanja makaronskega besedila, tj. besedila z dodanimi bese-
dami drugega jezika.
1 Introduction
Online review communities successfully allow its users to
write their opinion, suggestions over various social plat-
forms. These reviews greatly affect the decision to buy or
sell any product and to use any service. It is fruitful to
the manufacturer or service provider to enhance the pro-
ductivity. Automatic decision support systems take these
reviews into account for sentiment analysis. However, it
is extremely difficult to have reviews in uniform language.
During an automatic processing of reviews written online,
it is found that 2/3 of the internet users are non-English
[5] . The reason behind this is that most of the people
have the ability to learn only 2 or 3 languages proficiently.
In this technological world, people have equal priority to
write over the internet among different languages. People
who write reviews belong to different communities from
different regions of the world; they have the freedom to use
their native language too. When a text contains more than
one language, it is called as multilingual text. If a single
sentence contains more than one language, then it is called
as macaronic text[18].
Example 1: Samsung aQCA cellphone ,
In the above mentioned text, it is taken as macaronic con-
tent containing Hindi and English languages.
These irregularities found in the data over the internet
make the processing more complex. Due to the scarcity of
the language resources over the web, it becomes very diffi-
cult to handle all the possible languages over the globe. It is
a challenging task of a natural language processing group.
The formalism in sentiment analysis limits the system to
specific users. The reviews from all the users of a particu-
lar entity are valuable. It increases the need of automated
systems to handle multilingual content. Derkacz et.al.[12]
stated some of the requirements to have a multilingual au-
tomated system. These requirements are further taken care
by language processors to build a multilingual system. In
case multilingual systems, the language of whole document
is taken into account whereas for macaronic language pro-
cessing, we need to detect the language of each word. This
paper proposes a sentiment analyser which deals with the
macaronic text. Initially, reviews are to be normalized du-
ring pre-processing stage. Later, these reviews are proces-
sed through sentiment analyser.
This paper is organised as: section 2 describes the state
of the art sentiment analysers. In section 3, system design
and algorithm is proposed. Experimental analysis using va-
rious performance metrics are presented in section 4. Fi-
nally, the whole work is concluded in section 5.
128 Informatica 42 (2018) 127–136 S. Kaur et al.
2 Related work
Numerous researchers have worked in the field of natural
language processing. Kaur et.al.[14] presented sentiment
analysis of reviews written in Punjabi language. The rese-
archers collected the reviews written in Punjabi which af-
terwards segregated into positive or negative reviews. Das
et.al.[8] found the need of having SentiWordNet for Ben-
gali language. Their work helped the researchers in the
field of sentiment analysis. The researchers annotated the
required lexicon. Das et.al.[7] worked for sentiment ana-
lysis of reviews written in Bengali language. In this paper,
the researchers have used support vector machine (SVM)
with Bengali SentiWordNet. The paper presents the fea-
ture extraction for Bengali language. Das et.al.[6] deve-
loped subjectivity clues based on theme detection techni-
ques. Bengali corpus is used in their work and later com-
pared the results with English subjectivity detection. Das
et.al.[9] developed a gaming theory by which researchers
can easily build the SentiWordNet in the required language.
This work demands the respective linguistic experts. Joshi
et.al.[13] used supervised learning approach for their work
by using Hindi- SentiWordNet for their work. In this pa-
per, researchers used standard translation techniques to pre-
serve the polarity of each document while translating it.
Bakliwal et.al.[2] worked for detecting subjectivity based
on graph theory. Researchers explored the effect of syno-
nym and antonym over the subjective nature of the docu-
ment. The results were good for Hindi and English. The
researchers claimed that their strategy will work well in ot-
her languages too. Das et.al.[10] developed a system for
deducing the emotion and intensity of emotion based on
sentiment hidden in the data. In this work, researchers
have used supervised learning methods for their work. Ri-
cha et.al.[21] presented a survey for sentiment analysis in
Hindi language. The results have shown that sentiment
analysis in Hindi language is complex as compared to En-
glish language. The reason behind this complexity is the
non-uniform nature of the Hindi language. Various rese-
arch challenges are also discussed. Researchers[21] deve-
loped a system which depicts the polarity of the text and
tested their system over the Hindi movie reviews. Parul
et.al.[1] developed a sentiment analyser for movie reviews
written in Punjabi language using various machine learning
algorithms. Raksha et.al.[20] used semi-supervised techni-
que for polarity detection in Hindi movie reviews. In their
work, researchers reported 87% accuracy of the proposed
system by using bootstrapping and graph based approach
for sentiment analysis. Pooja et.al.[17] used Hindi Senti-
WordNet for finding opinion orientation of the reviews. Re-
searchers used unsupervised learning for their work. Ker-
stin et.al.[11] developed a system for multilingual text for
obtaining the polarity of reviews written in language other
than resource rich language English. Researchers used a
standard translation methodology and supervised learning
for sentiment analysis. C. Banea et.al.[3] developed a sy-
stem which focused on the sentiment analysis based on
translation of input document other than English. In their
work, researchers used English as a source language. They
used supervised learning approach for their work. For the
translation of the text correctly various available translators
are used. i.e. Goggle, Moses, Bing translators.
The work by different researchers is summarized into ta-
ble 1 . It is noticed that researchers are focusing well in
the area of multilingual sentiment analysis. Researchers
focused in finding document language for translating any
document into base language instead of language of indi-
vidual word. This sometimes discard the opinion bearing
word written in any foreign language. As in example 1,
the word aQCA, means good is discarded if the document
language is detected as English. The efficient processing of
such documents is required to increase the effectiveness of
the decision support system.
2.1 Motivation
After looking into the scenario, we found that we need Sen-
tiWordNet in almost every language all over the global. It
is very complex task. The motivation behind the proposed
system is that the existing system for multilingual senti-
ment analysis is unable to process macaronic data. The rise
in the volume of macaronic data over the internet arise the
need of proposed system. The reasons for having macaro-
nic content over the web in huge volume are as follows:
1. Scarcity of Resources: Sentiment analysis task de-
mands for the availability of lexicons or data of any
particular language. There is huge variation in every
language model. This makes the model used for one
language cannot be used for other languages.
For example: Chinese language model does not con-
sider spaces while as other models focus mainly over
spaces to tokenize.
2. Lack of uniformity of languages: Most of the lan-
guages often follow their own traditional structures.
Thus, processing of each language data with the gene-
ral structure model gives unsatisfactory results.
For example: English language use Subject-Verb-
Object(SVO) while Hindi Language model follow
Subject-Object-Verb (SOV)
3. Freedom of writing in native language: People these
days have number of followers from different coun-
tries through various online applications. They are
also able to propagate their ideas through it. Someti-
mes, few words they prefer writing in their own native
language, which may not be understandable by some
of the followers. In case of an automated system,
during pre-processing through one language model,
these native words may be neglected taken as foreign
language words. Sometimes, we may lose meaningful
information during this type of pre-processing.
For example: s{ms\g is on great demand.
s{ms\g(Samsung) is negected by English language
Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 129
Author Work Level Language Results Technique Corpus Year
Danet et.al.[5] Classification of re-
views into positive or
negative opinion
Document
level
Punjabi Accuracy =
75%
Machine
Learning
Blogs 2014
Derkacz
et.al.[18]
Classification of re-
views into positive,
negative, neutral or
emotion (sad, happy,
etc)
Document
level
Bengali Precision
= 70.04%,
Recall =
63.02%
Machine
Learning
Custom
Lexicon
2010
Das et. al.[14] Document are separa-
ted based on Domain
independent subjecti-
vity and factual con-
tent
Sentence
Level
Bengali Precision
= 70.04%,
Recall =
63.02%
Machine
Learning
Custom
Lexicon
2009
Bandyopadhyay
et. al.[6]
Sentiment analysis
of Hindi reviews,
English reviews using
Hindi SentiWordNet
Document
Level
Hindi,
English
Precision
= 70.04%,
Recall =
63.02%
Supervised Movie
reviews
2012
Joshi et. al.[9] Subjectivity clues ba-
sed on antonym and
synonym using graph
theory
Document
Level
Hindi,
English
Accuracy =
79%
Supervised Movie
reviews
2012
Sharma et.
al.[10]
Polarity detection
of movie reviews
using unsupervised
techniques
Sentence
Level
Punjabi NA Unsupervised Movie
reviews
2015
Arora et.
al.[21]
Sentiment orientation
of reviews written in
Hindi language
Document
Level
Hindi Precision
= 70.04%,
Recall =
63.02%
Unsupervised Movie
reviews
2014
Sharma et.
al.[1]
Sentiment analy-
sis using Semi-
Supervised techniques
Document
Level
Hindi Accuracy =
87%
Semi-
Supervised
Movie
reviews
2014
Pandey et.
al.[20]
Opinion orientation of
Hindi movie reviews
is deduced using
Hindi-WordNet
Document
Level
Hindi NA Unsupervised Movie
reviews
2015
Denecke et.
al.[17]
Polarity detection
from reviews using
standard translation
of German reviews
in English afterwards
find the polarity
Document
Level
German,
English
Accuracy =
66%
Supervised Movie
review
2008
Banea et.
al.[11]
Enabling Multilingual
question answering
system
Document
Level
French,
Ger-
man and
Spanish
NA Supervised Question
Answers
2016
Table 1: State of Art Multilingual Sentiment Analysis
130 Informatica 42 (2018) 127–136 S. Kaur et al.
model. Thus, it becomes difficult to extract samsung
as an entity.
4. For getting point of attraction: People use the mul-
tilingual content or some fancy words in various
applications like product advertisements, shop names,
etc. This makes the task of processing such web
content complex. For example:
samsung (Samsung) is on great demand.
samsung (Samsung) is on great demand.
m ona(Mona) is feeling so good.
Hence, from the above examples, Samsung is hard
to detect as it is being neglected by chosen language
model.
Due to the above mentioned reasons, it is very much ne-
cessary to have an efficient system to process macaronic
language content. Our contribution is to enhance the per-
formance i.e.precision, recall and accuracy using supervi-
sed sentiment analysers. The proposed system is with less
fallout which shows its high efficiency.
3 System design
The proposed system as shown in Figure 1 applies a vari-
ant of techniques for normalization of macaronic text and
classification of reviews. The system consists three major
components:
1. Language Processing
2. Text Processing and
3. Sentiment Analysis
A component based on language detection is carried out
using algorithm 1. The core idea of this component is to
normalize the macaronic content. Other two components
are carried out using algorithm 2. It normalizes the content
to extract the SentiStrength of each document. Combina-
tion of these two algorithms (Algorithm 1 and Algorithm
2) is used to carry out sentiment analysis for multilingual
or macaronic language documents.
Figure 1: Proposed System Design
1. Language Processing: It is the primary component of
the proposed system. In this component tokenization,
language detection and conversion of tokens to its
base language is carried out. These sub-components
are described as follows:
(a) Tokenization: It is the basic unit of any language
processing task. A sequence of sentences, words
or characters are passed as an input to any sy-
stem.
The output of this phase is tokens. It can be done
at different levels depending upon the level of
granularity: sentence level, word level, character
level as shown in table 2. The proposed system
is based on word level tokenization for macaro-
nic language.
E.g. Samsung has a good market value. Users
are happy with its mobile products.
(b) Language Detection/ Translation: For language
detection, we have used PoS[19] tagging, as
shown in table 3. The unrecognized or untagged
tokens can be passed through language detection
module. The output of this phase is the tokens in
the base language of the system. i.e. Taking En-
glish as a base language. If the token is found in
Hindi WordNet then Hindi to English translator
is applied to it. On the other end, if the word be-
longs to Punjabi language, it is passed through
the Punjabi to English translator. It is a general
procedure which can be applied to various other
languages too.
2. Text Processing: It is the second important component
of the proposed system. It carries various sub-tasks
described as follows:
(a) Normalization: After filtration of subjective sen-
tences, normalization is to be done. The pro-
cess of normalization is to regularize or process
the grammatical variants present in the sentence.
Grammatical variants include past verbs (regu-
lar and irregular) / present verbs, classification
of noun phrases in singular and plural. In norma-
lization, finding the abbreviations, case folding,
etc.Normalization is a process having data in a
well format as required for appropriate proces-
sing. It includes:
Level of Pro-
cessing
Number of
Tokens
Sentence Level 2
Word Level 13
Character Le-
vel
74
Table 2: Tokenization at different levels
Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 131
i. Handling Slangs: Slangs are playing in-
dispensable role in opinion mining. So, it
will be worthless to reject all the slangs by
counting them as stop words. Various algo-
rithms are applied to handle different types
of slangs.Types of slangs[5]:
– Emoticons: Bad , happy
– Interjections:Mmmmm-pleasure,
hmmmm-wondering, Mhmmm-
confirmation
– Intensionally misspelled: cooooooool,
goooooooood, nyt, etc
– Alphanumeric strings: gr8, 9t, etc.
Test sentence:
She is flying high by having this
cellphone.
She is flying high by having this cellphone.
Happy
ii. Idiomization / Replacement of idioms with
their actual meaning: In English literature,
idioms play very important role in fixing the
opinion from sentence about the particular
entity. If the stops words are removed then
some words which may or may not the part
of the idiom can be rejected. In reality, these
words are highly contributed to the opinion.
Test sentence:
She is flying high by having this cellphone.
Happy
She is very happy by having this cellphone.
Happy
(b) Tokenization: In our work, we have used word
level tokenizer as mentioned in table 2 . The
reason behind this to process each token accor-
ding to its own language instead according of
language of the document.
(c) PoS Tagging: Part of speech tagging plays a vital
role in natural Language processing tasks. Initi-
ally, we have tried to focus whether the state of
art PoS taggers are able to recognise a foreign
word. For this purpose, we have used NLTK
tagger[15] and Stanford Tagger[16]. We have
shown the results of both the taggers for vari-
ous test sentences in table3. We have found va-
rious untagged tokens which are then processed
through language processing phase.
3. Sentiment Analysis: In this module, the potency of
each review is calculated. The magnitude of the senti-
ment associated with each document is calculated by
aggregating all the review’s sentiscore corresponding
to that document. SentiWordNet is the base for get-
ting the actual magnitude of the sentiment of a do-
cument. For our work, we have used SentiWordNet
v3.0.0. Sentiscore corresponds to each document is
taken as an output as shown in table4 .
4 Evaluation
4.1 Dataset
We have extracted a corpus of reviews of 10 movies contai-
ning 200 movie reviews i.e.100 positive and 100 negative;
160 reviews were used for training and 40 for testing. Each
review has a size ranges from 500 to 1000 words. Initi-
ally, classification of the corpus is elaborated according to
user′s scoring: reviews are marked between 3 and 5 star
rating are classified as positive whereas reviews marked
between 0 and 2 are taken as negative. This prior classi-
fication is based on the assumption that the star rating is
correlated to the sentiment of the review. For experiment
evaluation, the data was pre-processed with the TreeTag-
ger5, POS tagger and lemmatization tool. We have used
Support Vector Machine (SVM), Nave Bayes, kNN and
convolutional network as classification models to train the
system and classify movie reviews. The reviews are not
monolingual. These reviews are macaronic in nature i.e. it
consists of more than one language i.e. Hindi and English
in a single review. We manually annotate the reviews based
on language of each token. The guidelines for annotation
are stipulated the need of retaining the semantic structure
of tokens. Five different graduate students participated in
the reviewing process to formulate Gold Standard. To eva-
luate the inter-personnel disagreement, we have used kappa
measure[4] and score 0.61 is obtained.
4.2 Performance
Formally, the performance of proposed sentiment analyser,
PSA is a function of four factors as follows:
PSA(l,Ld,t,Es)
Where Ld is Language Detection
l is a Learning Algorithm
t is a Tagger
Es is a Experimental Setup
The performance of the analyser is directly affected by
the choice of optimal parameters for each factors mentio-
ned above. In the case of optimal parameters choice for
each of the factor, sentiment analyser gives maximum per-
formance (PSAmax).
On the other end, training consists machine translated
data and testing of the learning algorithm is based on the
human annotated dataset i.e. Gold Standard. The perfor-
mance of sentiment analyser (PSA) is negatively affected
by error in language detection phase (ELd) as given in
equation 1 .
PSA = PSAmax − ELd (1)
In case of optimal parameters, ELd → 0, PSA =
PSA max
132 Informatica 42 (2018) 127–136 S. Kaur et al.
Test Sentence Pos tagging by NLTK tagger Stanford tagger
mFEwyA gyAn kA ek ÿ
aQCA srot h{\
mFEwyA—NN
gyAn—:kA—:ek—:ÿ
aQCA—srot—h{\—
mFEwyA/VBZ gyAn/NNP
kA /NNP ek /NNP ÿ
aQCA/NNPsrot /NNP h{\
/NNP
media is aQCA source of
knowledge
media—NNS is—VBZ ÿ
aQCA—: source—NN of—IN
knowledge—NN
media/NNS is/VBZ
aQCA/JJ source/NN of/IN
knowledge/NN
mFEwyA gyAn kA ek
good srot h{\
mFEwyA—NN
gyAn—:kA—:ek—:good
—JJ srot —h{\—
mFEwyA/VBZ gyAn/NNP
kA /NNP ek /NNP
good/JJ srot /NNP h{\
/NNP
media gyAn kA ek ÿ
aQCA srot h{\
media—NNS
gyAn—:kA—:ek—:ÿ
aQCA—srot—h{\—
media/NNS gyAn/NNP
kA /NNP ek /NNP
aQCA/NNP srot /NNP
h{\ /NNP
Table 3: Tagging of various test sentences using NLTK and Stanford Tagger
Test Sentence SentiStrength
texttt mFEwyA is good
source of knowledge
0.47
media is good source of
knowledge
0.47
mFEwyA gyAn kA ek ÿ
aQCA srot h{\
0
media is aQCA source of
knowledge
0
mFEwyA gyAn kA ek
good srot h{\
0.47
media gyAn kA ek ÿ
aQCA srot h{\
0
Table 4: Sentiscore Associated With Review
Metric Target Target
Selected tp fn
Selected fp tn
Table 5: Confusion metric used to evaluate performance
4.3 Performance metric
For the analysis of results, the following performance me-
trics are used by various natural languages processing task
including sentiment analysis. It includes precision, recall,
F-measure and accuracy. These measures can be calculated
using the confusion metric given in table 5.
Precision: It is defined as fraction of retrieved documents
that are relevant. It is calculated using equation 2.
P =
number of correct positive or negative
documents detected by the system
no. of positive/negative documents
detected by the system
(2)
Recall: It is defined as fraction of relevant documents that
are retrieved. It is calculated using equation 3.
R =
number of positive or negative
documents detected by the system
no. of positive/negative documents
present in the Gold Standard test set
(3)
F-measure: It is a harmonic mean with takes precision
and recall both into account. It is a consecutive average of
precision and recall. F-measure with α = 0.5, means ta-
king precision and recall at equal weightage.It is calculated
using equation 4.
F =
(α2 + 1)× P ×R
α2(P +R)
(4)
Accuracy: it is the fraction of classifications that is correct.
. It is calculated using equation 5.
A =
tp + tn
tp + tn + fn + fp
(5)
Fall-out: It is a measure of the proportion of mistakenly se-
lected non- targeted items. . It is calculated using equation
6
FO =
fp
tn + fp
(6)
4.4 Results and analysis
The outcomes of our experimental study are presented in
Table 6 and Table 7. We can easily notice that every ma-
chine learning approach has its own pros and cons. Each
of them is valuable in different aspects i.e. precision, re-
call, accuracy, fallout and execution time. To validate our
results we have used 10-fold cross validation. For the ex-
perimental setup, we have used Support Vector Machines
Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 133
Learning
Approaches
Precision Recall Accuracy Fallout Time(sec)
NB 51.58 50.4 50.4 92.8 422
SVM 62.29 62 62 45.6 428
kNN 52.01 52 52 49.6 421
Convolutional
network
54.96 54 54 24 751
Table 6: Un-normalized Macaronic Sentiment Analysis
(a) Comparing different learning
approaches based on Precision
(b) Comparing different learning
approaches based on Recall
(c) Comparing different learning
approaches based on Accuracy
(d) Comparing different learning
approaches based on Fallout
Figure 2: comparision of various methods
Figure 3: Comparison of execution time various machine
learning Algorithms based on Proposed Scheme for nor-
malized and unnormalized data
(SVM), Nave Bayes (NB), kNN and Convolutional net-
work(Deep Learning) to analyse the performance of propo-
sed algorithm. The results are shown in Table 6 and Table
7. Precision, recall, accuracy, fallout are taken in percen-
tage and time is taken in seconds. Time taken by each of
the learning technique is very dependent on data size, data
types, number of columns, computer hardware, memory,
background running processes, cores, etc. This may vary
with the change in any of the mentioned attribute. Hence,
the time taken in table 6 and table 7 helped in deducing the
time trend of each learning model. It is shown as an in-
creasing order and noticed the reduction in the time to the
marginal level in normalized content.
Figure 4: Comparison of proposed technique with State of
art
Order for unnormalized content:
kNN < NaiveBayes < SVM <
Convolutionalnetwork
Order for normalized content:
NaiveBayes < SVM < kNN <
Convolutionalnetwork
The results have shown in Figure 2 clearly evident the
performance of proposed system using various learning ap-
proaches. These figures highlight the proposed system per-
formance in various aspects. The proposed scheme out-
performs the existing system using Nave Bayes by the rise
in the values of precision, recall by 17.88% and 18.22%.
Observing the results of other classifiers i.e. SVM, kNN
and convolutional network also shows significant impro-
134 Informatica 42 (2018) 127–136 S. Kaur et al.
Learning
Approaches
Precision Recall Accuracy Fallout Time(sec)
NB 69.46 68.62 68.63 28.79 18
SVM 71.72 71.69 71.75 20.21 21
kNN 65.41 65.31 65.47 40.21 29
Convolutional
network
58.03 54.56 55.00 13.04 440
Table 7: Proposed normalized Macaronic Sentiment Analysis
Approach Precision Recall Accuracy Fallout
Baseline 55.21 54.6 54.6 53
Proposed 66.15 65.04 65.21 25.56
Table 8: Comparison with Existing Sentiment Analysis
vements in performance levels. Using SVM and kNN more
than 9% and 13% improvement is noticed in precision and
recall values using proposed approach. It is also noticea-
ble that there is a trade-off between various performance
aspects. The effectiveness of system is shown by convolu-
tional network but it takes more time than other classifiers
for macaronic sentiment analysis.
Through observing Figure 3, we have found that the pro-
posed algorithm also greatly affect the time taken by each
model. It is noticeable that the normalized content reduces
the training time in every learning approach. By observing
Table 8, results are compared to the baseline approaches;
the average value of precision, recall is increased while the
fallout is decreased significantly. Figure 4 shows that how
effective the proposed approach is as compared to the state
of the art sentiment analysis for macaronic language.
5 Conclusion
Over the web where huge user generated content has al-
ready existed; the need for sensible computation for de-
cision support system is rising. The multilingual online
content has led to the increase of web debris, which is in-
evitably and negatively affecting information retrieval and
extraction for decision support systems. To analyse this
negative trend and propose possible solution, this paper fo-
cused on the evolution of sentiment analysis based on bag-
of-words for macaronic reviews. Different supervised ma-
chine learning approaches gave different cross validated re-
sults. This is done by borrowing the concept of training and
testing from the field of machine learning. After successful
evaluation, it is concluded that there is a trade-off between
various performance measures. In this study, we have in-
vestigated the need to normalize the macaronic text. We
have also performed sentiment analysis over the macaronic
language text consists English and Hindi. We have found
an average of about 11% rise in precision and recall va-
lues. It is also noticeable that training time is also reduced
significantly using proposed approach. We further plan to
develop a system to handle with more than two languages
as a macaronic text for sentiment analysis. We also plan to
apply our proposed algorithm for entity extraction.
References
[1] Arora, P. and B. Kaur (2015). ”Sentiment Analysis of
Political Reviews in Punjabi Language.” International
Journal of Computer Applications 126(14).
[2] Bakliwal, A., P. Arora, et al. (2012). Hindi sub-
jective lexicon: A lexical resource for hindi polarity
classification. Proceedings of the Eight International
Conference on Language Resources and Evaluation
(LREC).
[3] Banea, C., R. Mihalcea, et al. (2008). Multilingual
subjectivity analysis using machine translation. Pro-
ceedings of the Conference on Empirical Methods in
Natural Language Processing, Association for Com-
putational Linguistics.
[4] Bunt, H., V. Petukhova, et al. (2016). Dialogue Act
Annotation with the ISO 24617-2 Standard. Multimo-
dal Interaction with W3C Standards, Springer: 109-
135.
[5] Danet, B. and S. C. Herring (2003). ”Introduction:
The multilingual internet.” Journal of Computer Me-
diated Communication 9(1): 0-0.
[6] Das, A. and S. Bandyopadhyay (2009). Theme de-
tection an exploration of opinion subjectivity. Af-
fective Computing and Intelligent Interaction and
Workshops, 2009. ACII 2009. 3rd International Con-
ference on, IEEE.
[7] Das, A. and S. Bandyopadhyay (2010). Opinion-
Polarity Identification in Bengali. International Con-
Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 135
ference on Computer Processing of Oriental Langua-
ges.
[8] Das, A. and S. Bandyopadhyay (2010). ”SentiWord-
Net for Bangla.” Knowledge Sharing Event-4: Task
2.
[9] Das, A. and S. Bandyopadhyay (2010). ”SentiWord-
Net for Indian languages.” Asian Federation for Na-
tural Language Processing, China: 56-63.
[10] Das, D. and S. Bandyopadhyay (2010). Labeling
emotion in Bengali blog corpusa fine grained tagging
at sentence level. Proceedings of the 8th Workshop on
Asian Language Resources.
[11] Denecke, K. (2008). Using sentiwordnet for multilin-
gual sentiment analysis. Data Engineering Workshop,
2008. ICDEW 2008. IEEE 24th International Confe-
rence on, IEEE.
[12] Derkacz, J., M. a. Leszczuk, et al. Definition of Re-
quirements for Accessing Multilingual Information
and Opinions. Multimedia and Network Information
Systems, Springer: 273-282.
[13] Joshi, A., A. Balamurali, et al. (2010). ”A fall-back
strategy for sentiment analysis in hindi: a case study.”
Proceedings of the 8th ICON.
[14] Kaur, A. and V. Gupta (2014). ”Proposed Algorithm
of Sentiment Analysis for Punjabi Text.” Journal of
Emerging Technologies in Web Intelligence 6(2):
180-183.
[15] Kothapalli, M., E. Sharifahmadian, et al. ”Data Mi-
ning of Social Media for Analysis of Product Re-
view.” International Journal of Computer Applicati-
ons 156(12).
[16] Nguyen, D. Q., D. Q. Nguyen, et al. ”A robust
transformation-based learning approach using ripple
down rules for part-of-speech tagging.” AI Commu-
nications 29(3): 409-422.
[17] Pandey, P. and S. Govilkar (2015). ”A Framework for
Sentiment Analysis in Hindi using HSWN.” Interna-
tional Journal of Computer Applications 119(19).
[18] Renduchintala, A., R. Knowles, et al. ”Creating in-
teractive macaronic interfaces for language learning.”
ACL 2016: 133.
[19] Seih, Y.-T., S. Beier, et al. ”Development and Exami-
nation of the Linguistic Category Model in a Compu-
terized Text Analysis Method.” Journal of Language
and Social Psychology: 0261927X16657855.
[20] Sharma, R. and P. Bhattacharyya ”A Sentiment Ana-
lyzer for Hindi Using Hindi Senti Lexicon.”
[21] Sharma, R., S. Nigam, et al. (2014). ”Polarity de-
tection movie reviews in hindi language.” arXiv pre-
print arXiv:1409.3942.
Algorithm 1:
Input: Document D where D = d1, d2, d3, ....., dk
’k’ is the total no. of documents
’m’ is the total number of words in a document
Ls = language of segment
Lb = Base language (English)
Output:
Ws(weightedSentiStrengthofeachdocuemnt)
Begin
for k = 1 to k do
Tokenization
for i = 1 to m do
Encoding based on UTF8
end for
{Similar category segments are combined}
Segmentation based on encoding.
Language detection for each segment.
if Ls = Lb then
goto S1
else
Apply translation
end if
S1 Assemble segments
Compute SentiStrength
end for
136 Informatica 42 (2018) 127–136 S. Kaur et al.
Algorithm 2:
Input: Document D where D = d1, d2, d3, ....., dk
’k’ is the total no. of documents
’m’ is the total number of words in a document
Output:
Ws(weightedSentiStrengthofeachdocuemnt)
{Token list(TL) = (t1,t2,.....,tn)}
{Word List(WL)= (w1,w2,w3,......wx)}
{’q’ is the total number of tokens in a document}
{P = list of ’positive category words}
{N = list of ’negative category words}
{Pw = weight assigned to a token belongs to positive
category as per SentiWordnet}
{Nw = weight assigned to a token belongs to negative
category as per SentiWordnet}
Begin
for d = 1 to k do
Tokenization
Stemming
Normalization
for k = 1 to m do
if (tk ∈W )
⋂
(tk ∈ P ) then
wpos(k) = Pw(tk)
else if (tk ∈W )
⋂
(tk ∈ N ) then
wneg(k) = Nw(tk)
else if (tk ∈W )
⋂
(tk 6∈ N)
⋂
(tk 6∈ N ) then
wneu(k) = 0
end if
end for
Ws =
m∑
j=1
wpos(j)±
m∑
j=1
wneg(j) (7)
end for