Sentiment analysis using NLTK

4 min readJan 18, 2021

Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, attitudes, and emotions expressed in written language. It is one of the most active research areas in natural language processing and text mining in recent years. Its popularity is mainly due to its wide range of applications because opinions are central to almost all human activities and are key influencers of our behavior.

Sentiment classification is a way to analyze the subjective information in the text and then mine the opinion. Sentiment analysis is the procedure by which information is extracted from the opinions, appraisals, and emotions of people in regard to entities, events, and their attributes.

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence, or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.

Sentiment Analysis of text involves a number of stages starting from

collection of data and data extraction,
preprocessing,
detection and classification of sentiment, and
presentation of results.

Data Extraction:

The first of analyzing is data extraction. Data extraction is the process of retrieving the data posted by people. After this data is preprocessed i.e., data is cleaned by removing noisy data.

2. Pre-Processing of Data:

This is another important step to proceed further is preprocessing. Preprocessing of data is required to clean the data to acquire the required data. In this step, all the noisy characters are removed from the text to further analyze it. The misspelled words, grammatical errors, punctuation errors, unnecessary capitalization, stop words and use of non-dictionary words such as abbreviations or acronyms of common terms are few examples of noise in the text.

The main goal of the preprocessing step is to standardize the text into a relevant form to derive the sentiments of the user.

Following are the steps to pre-process text into useful data for classification:

a. Tokenization:

First of all the text is tokenized. Tokenization is the process of natural language processing(NLP) by which large textual data is divided into smaller parts called tokens.

This step is a crucial step in NLP. nltk word_tokenizer() is used to tokenize the input data. In this, a sentence is split into words. Then the output of the tokenization is converted into a data frame.

b. Lemmatization and Stemming

The next step of preprocessing is Lemmatization and Stemming. They both seem similar but are different because the stemming method cuts the suffix from the word i.e. either the ending or the beginning of a word which sometimes makes the word meaningless. For example: Stemming for studies is studied, which indeed have no meaning in the dictionary.

On the other hand, lemmatization is a much better method and is more powerful as it also considers the morphological analysis of a word which helps in the conversion of the word into its base form without changing its meaning. For example Lemma for studies is study. Lemmatization is a smart method and it helps in creating better machine learning characteristics.

c. Stop words removal:

Stop words removal is one of the major preprocessing steps as it is used to filter out useless data. In natural language, stop words are the frequently used words such as is, am, are, an, the etc. which have very little meaning. These words are removed as they do not add any value to the analysis.

Code snippet for removing stop words and Punctuation:

3. Determination of sentiment polarity:

The final step of the analysis is determining the sentiment and intensity of the polarity. The polarity is determined by the count of words associated with a certain sentiment.

Code snippet for determining polarity: