The descriptors are called the tags and the
automatic assignment of the descriptors to the given tokens is called tagging.
The process of assigning
one of the parts of speech to the given word is called Parts Of Speech tagging,
commonly referred to as POS tagging. Parts of speech include nouns, verbs,
adverbs, adjectives, pronouns, conjunction and their sub-categories.
Part-Of-Speech Tagger (POS Tagger) is a software that reads text and then assigns
parts of speech to each word (and other token), such as noun, verb, adjective,
etc., It uses different kinds of information such as dictionary, lexicons,
rules, etc. because dictionaries have category or categories of a particular word,
that is a word may belong to more than one category. For example, run is both
noun and verb so to solve this ambiguity taggers use probabilistic information.
are mainly two type of taggers:
– Uses hand-written rules to distinguish the tag ambiguity.
taggers are either HMM based – chooses the tag sequence which maximizes the
product of word likelihood and tag sequence probability, or cue-based, using
decision trees or maximum entropy models to combine probabilistic features.
Markov Model (HMM) is a statistical Markov model in this the system being
modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.
In Markov models, the state is directly visible, and thus the state transition
probabilities are the only parameters, while in the hidden Markov model, the
state is not directly visible, but the output dependent on the state is
visible. Each state has a probability distribution over the possible output
tokens. Therefore, the sequence of tokens generated by an HMM gives some
information about the sequence of states.
most probable tag sequence given some word string can be computed by taking the
product of two probabilities for each tag sequence, and
the tag sequence for which this product is greatest. The two terms are the
prior probability of the tag sequence, and the likelihood of the word string
but it is difficult to compute so, HMM
taggers makes two assumptions. The first assumption is that the probability of
a word appearing is dependent only on its own part-of-speech tag; that it is
independent of other words around it, and of the other tags around it
P(wn1|t1n) ? ¸P(wi|ti)
i=1.?The second assumption is that the
probability of a tag appearing is dependent only on the previous tag, the bigram
P(t1n) ? ¸P(ti|ti?1)
European group developed CLAWS, a tagging program that did exactly this, and
achieved accuracy in the 93–95% range.
machine learning methods have also been applied to the problem of POS tagging.
Methods such as SVM, maximum entropy classifier, perceptron, and
nearest-neighbor have all been tried, and most can achieve accuracy above 95%.
more recent development is using the structure regularization method for
part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.
Data set and Pre-processing
merged Bhojpuri dataset of 2,85,536 token is used containing 20,000(approx.) number of of
sentences of Bhojpuri and the corresponding labels to the words.
inputs and labels were constructed by removing the tags from the sentences by
splitting the tag and the tokens from the sentences by placing an space between
them and also if there were any blank spaces those were also removed.
MODELS & ALGORITHM
It is a useful representation
transformation for training sequence classifiers in Natural Language Processing
models that typically work by extracting feature around a particular word of
interest. It is used to convert feature
arrays represented as lists of standard Python dict objects to the NumPy/SciPy
representation used by scikit-learn estimators.
It is a non-parametric supervised
learning method used for
and regression,it creates a model that predicts
the value of a target
variable by learning simple decision
rules inferred from the
Natural Language Processing(NLP) with
is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries, and an active
discussion forum. It has many libraries to work on natural language. Using we
can tokenize and tag some text, identify some named entities and display a
In order to study the relation between words and their tags, I performed
the modelling in two phases. In the first phase, I trained my dictvectorizer
model against the pre- processed dataset. The second phase involved training
and prediction of different classifier models based on the sentence vector fed
into them. The different models trained were then compared to the originals for
the evaluation using the metrics accuracy score. First the sentences were preprocessed and then tokenized then those
tokens were used to train the dictvectoriser model. Then we assigned the
sentence representations list to the variable X and the label list for
the corresponding tweets were assigned to variable y. I divided the dataset
into 2 parts, training (75%) and testing (25%). The training dataset was used
to train the classifier. The accuracy score for the model was calculated
using the metric module of scikit-learn toolkit which was 70.2%.The
total number of sentences in the dataset were 19,439 out of those I have used
14579 as training sentence and 4870 for the testing.
I express my profound and sincere gratitude to my mentor Dr. Anil Kumar
Singh for providing me with all the facilities and support during my winter
I would like to thank my guide Mr. Rajesh Mundotiya for their valuable
guidance, constructive criticism, encouragement and also for making the
requisite guidelines enabling me to complete my work with utmost dedication and
At last, I would like to acknowledge my family and friends for the motivation,
inspiration and support in boosting my moral without which my efforts would
have been in vain.