Tagging: The descriptors are called the tags and theautomatic assignment of the descriptors to the given tokens is called tagging.POS TaggingThe process of assigningone of the parts of speech to the given word is called Parts Of Speech tagging,commonly referred to as POS tagging. Parts of speech include nouns, verbs,adverbs, adjectives, pronouns, conjunction and their sub-categories.POS TaggerAPart-Of-Speech Tagger (POS Tagger) is a software that reads text and then assignsparts of speech to each word (and other token), such as noun, verb, adjective,etc., It uses different kinds of information such as dictionary, lexicons,rules, etc. because dictionaries have category or categories of a particular word,that is a word may belong to more than one category.
For example, run is bothnoun and verb so to solve this ambiguity taggers use probabilistic information.Thereare mainly two type of taggers: Rule-based- Uses hand-written rules to distinguish the tag ambiguity. Stochastictaggers are either HMM based – chooses the tag sequence which maximizes theproduct of word likelihood and tag sequence probability, or cue-based, usingdecision trees or maximum entropy models to combine probabilistic features.HMMHiddenMarkov Model (HMM) is a statistical Markov model in this the system beingmodeled is assumed to be a Markov process with unobserved (i.
e. hidden) states.In Markov models, the state is directly visible, and thus the state transitionprobabilities are the only parameters, while in the hidden Markov model, thestate is not directly visible, but the output dependent on the state isvisible. Each state has a probability distribution over the possible outputtokens. Therefore, the sequence of tokens generated by an HMM gives someinformation about the sequence of states.
Themost probable tag sequence given some word string can be computed by taking theproduct of two probabilities for each tag sequence, andchoosingthe tag sequence for which this product is greatest. The two terms are theprior probability of the tag sequence, and the likelihood of the word stringbut it is difficult to compute so, HMMtaggers makes two assumptions. The first assumption is that the probability ofa word appearing is dependent only on its own part-of-speech tag; that it isindependent of other words around it, and of the other tags around it P(wn1|t1n) ? ¸P(wi|ti) i=1.?The second assumption is that theprobability of a tag appearing is dependent only on the previous tag, the bigramassumption: P(t1n) ? ¸P(ti|ti?1) Accuracy achieved TheEuropean group developed CLAWS, a tagging program that did exactly this, andachieved accuracy in the 93–95% range.
Manymachine learning methods have also been applied to the problem of POS tagging.Methods such as SVM, maximum entropy classifier, perceptron, andnearest-neighbor have all been tried, and most can achieve accuracy above 95%.Amore recent development is using the structure regularization method forpart-of-speech tagging, achieving 97.36% on the standard benchmark dataset.
Data set and Pre-processingDatasetAmerged Bhojpuri dataset of 2,85,536 token is used containing 20,000(approx.) number of ofsentences of Bhojpuri and the corresponding labels to the words.Pre-processingTheinputs and labels were constructed by removing the tags from the sentences bysplitting the tag and the tokens from the sentences by placing an space betweenthem and also if there were any blank spaces those were also removed. MODELS & ALGORITHM1. DICTVECTORIZERIt is a useful representationtransformation for training sequence classifiers in Natural Language Processingmodels that typically work by extracting feature around a particular word ofinterest. It is used to convert featurearrays represented as lists of standard Python dict objects to the NumPy/SciPyrepresentation used by scikit-learn estimators. 2.
DECISION TREE It is a non-parametric supervisedlearning method used for classificationand regression,it creates a model that predicts the value of a targetvariable by learning simple decision rules inferred from thedata features Natural Language Processing(NLP) withPython NLTKis a leading platform for building Python programs to work with human languagedata. It provides easy-to-use interfaces to over 50 corpora and lexicalresources such as WordNet, along with a suite of text processing libraries forclassification, tokenization, stemming, tagging, parsing, and semanticreasoning, wrappers for industrial-strength NLP libraries, and an activediscussion forum. It has many libraries to work on natural language.
Using wecan tokenize and tag some text, identify some named entities and display asparse tree. DEVELOPMENT In order to study the relation between words and their tags, I performedthe modelling in two phases. In the first phase, I trained my dictvectorizermodel against the pre- processed dataset. The second phase involved trainingand prediction of different classifier models based on the sentence vector fedinto them. The different models trained were then compared to the originals forthe evaluation using the metrics accuracy score.
First the sentences were preprocessed and then tokenized then thosetokens were used to train the dictvectoriser model. Then we assigned thesentence representations list to the variable X and the label list forthe corresponding tweets were assigned to variable y. I divided the datasetinto 2 parts, training (75%) and testing (25%). The training dataset was usedto train the classifier. The accuracy score for the model was calculatedusing the metric module of scikit-learn toolkit which was 70.2%.
Thetotal number of sentences in the dataset were 19,439 out of those I have used14579 as training sentence and 4870 for the testing. ACKNOWLEDGEMENT I express my profound and sincere gratitude to my mentor Dr. Anil KumarSingh for providing me with all the facilities and support during my winterinternship period. I would like to thank my guide Mr. Rajesh Mundotiya for their valuableguidance, constructive criticism, encouragement and also for making therequisite guidelines enabling me to complete my work with utmost dedication andefficiency. At last, I would like to acknowledge my family and friends for the motivation,inspiration and support in boosting my moral without which my efforts wouldhave been in vain.