Sentiment Analysis Python: How To – Data Science Projects

Sentiment Analysis Python: Understanding Emotions through Data Science Projects

In the realm of data science projects, sentiment analysis Python emerges as a powerful tool for unraveling emotions. With its simplicity and effectiveness, Python aids in extracting valuable insights from textual data. Sentiment analysis, also known as opinion mining, is a powerful technique that leverages machine-learning algorithms to analyze and understand human emotions expressed in text. Python, with its rich ecosystem of libraries and frameworks, offers a flexible and accessible platform for performing sentiment analysis tasks. In this article, we explore how Python can be used to extract sentiments from text data, empowering businesses and researchers to gain valuable insights into customer feedback, social media trends, and more.

Sentiment Analysis Python: Natural Language Processing (NLP) Libraries

Sentiment analysis Python employs natural language processing (NLP) libraries, such as NLTK and spaCy, to analyze and classify sentiments expressed in text. Through this technique, emotions like positive, negative, or neutral can be identified.

Python’s Natural Language Processing (NLP) libraries provide a comprehensive toolkit for analyzing sentiments in text data. Sentiment analysis, a key application of NLP, enables us to understand and interpret the emotions expressed within the textual content. In this article, we delve into the world of sentiment analysis using Python’s NLP libraries. We explore how these powerful tools can be harnessed to extract valuable insights from textual data, empowering businesses and researchers to make informed decisions based on sentiment-driven information.

Sentiment Analysis Python: Understanding

To comprehend sentiment analysis in Python, it is essential to grasp the underlying concepts of tokenization, part-of-speech tagging, and sentiment classification. Python’s NLP libraries provide pre-trained models and tools for achieving accurate sentiment analysis.

Sentiment analysis is a crucial aspect of natural language processing (NLP) that involves extracting and interpreting emotions from text using machine learning. Python, a versatile programming language, offers numerous tools and libraries for performing sentiment analysis tasks. In this article, we aim to provide college students with a clear understanding of sentiment analysis. We will cover the fundamental principles, explore practical applications in various industries, and demonstrate how Python can be used to extract valuable insights from text data. By mastering sentiment analysis, students can enhance their analytical skills and apply this knowledge to real-world scenarios.

Sentiment Analysis Python: Techniques

Sentiment analysis is a widely used technique that allows us to understand and interpret emotions expressed in text data. With the increasing availability of textual content across social media, customer reviews, and online discussions, the demand for effective sentiment analysis techniques has grown significantly. In this article, we explore some of the most popular sentiment analysis techniques employed in the field. We aim to provide college students with a comprehensive overview of these techniques, including their underlying principles and practical applications. By gaining a deeper understanding of these popular sentiment analysis techniques, students can develop valuable skills for analyzing emotions and making data-driven decisions.

Naive Bayes

Naive Bayes is a widely used machine learning algorithm for sentiment analysis in Python. It is based on the Bayes theorem and assumes independence between features. Naive Bayes is computationally efficient and performs well with large datasets. It is particularly useful for sentiment classification tasks where we aim to predict the sentiment (positive, negative, or neutral) of textual data. In this article, we explore how Naive Bayes can be implemented in Python for sentiment analysis, providing college students with a foundational understanding of this technique.

Support Vector Machines (SVM)

Support Vector Machines (SVM) are powerful machine-learning models employed for sentiment analysis in Python. SVMs excel at separating data points using hyperplanes in high-dimensional spaces. They have been successfully applied to sentiment classification tasks where the goal is to predict the sentiment expressed in text data. In this article, we delve into the implementation and applications of SVMs in Python for sentiment analysis. By understanding SVMs for sentiment analysis, college students can gain valuable insights into this versatile algorithm.

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) are a type of neural network architecture widely used for sentiment analysis in Python. Unlike traditional feed-forward neural networks, RNNs have a feedback mechanism that allows them to capture sequential information in text. This makes them particularly suitable for sentiment analysis tasks where context and dependencies among words are important. In this article, we explore the implementation and applications of RNNs in Python for sentiment analysis, providing college students with a solid understanding of this powerful technique.

Sentiment Analysis: Feature Engineering

Feature engineering plays a pivotal role in sentiment analysis Python projects. Extracting relevant features like n-grams, word embeddings, and sentiment lexicons enhances the accuracy of sentiment classification models.

Feature engineering is a crucial step in sentiment analysis, where we extract meaningful features from textual data to improve the performance of sentiment classification models. By selecting and crafting relevant features, we can capture important information and patterns that help in distinguishing between different sentiments. In this article, we delve into the realm of feature engineering for sentiment analysis in Python.

Practice Also: Spam Email Detection: How To – Data Science Projects

We explore various techniques, such as bag-of-words, n-grams, and word embeddings, and demonstrate how they can be effectively used to enhance the accuracy and robustness of sentiment analysis models. College students will gain valuable insights into feature engineering and its significance in sentiment analysis tasks.

Sentiment Analysis Python: Practical Implementation and Use Cases

Sentiment analysis Python implementation involves steps such as data preprocessing, feature extraction, model training, and sentiment prediction. Python’s extensive libraries, like sci-kit-learn and TensorFlow, provide efficient tools for each stage. Implementing sentiment analysis allows us to extract valuable insights from textual data by analyzing and understanding the emotions expressed within the text. Here is an overview of the steps involved in implementing sentiment analysis in Python:

Data Collection

Collect relevant text data from various sources, such as social media, customer reviews, or online discussions.

Data Preprocessing

Clean the text data by removing irrelevant information, special characters, and punctuation. Normalize the text by converting it to lowercase and removing stop words.

Feature Extraction

Transform the preprocessed text data into numerical feature representations that machine learning models can understand. Techniques like bag-of-words, n-grams, or word embeddings can be used for feature extraction.

Labeling

Assign labels to the preprocessed data based on the sentiment expressed, such as positive, negative, or neutral.

Model Selection and Training

Choose an appropriate machine learning algorithm, such as Naive Bayes, Support Vector Machines (SVM), or Recurrent Neural Networks (RNN), and train the model using the labeled data.

Model Evaluation

Assess the performance of the trained model using evaluation metrics like accuracy, precision, recall, and F1 score. Adjust the model parameters if necessary.

Prediction

Apply the trained model to new, unseen text data to predict the sentiment expressed in the text.

By following these steps, college students can successfully implement sentiment analysis, gain insights from textual data, and make informed decisions based on the emotions expressed within the text.

Applications of Sentiment Analysis Python

Sentiment analysis Python finds applications in social media monitoring, brand reputation management, customer feedback analysis, and market research. Through sentiment analysis, businesses can gain valuable insights into public opinions and make data-driven decisions.

Sentiment analysis offers numerous applications across industries. It helps businesses analyze customer feedback and reviews, monitor brand reputation, conduct market research, and perform social media analytics. Additionally, sentiment analysis assists in understanding public opinion and improving customer support systems. By leveraging Python’s capabilities, businesses can gain valuable insights from textual data, make informed decisions, and enhance customer satisfaction and brand perception.

Sentiment Analysis Python: Code

You can find the complete code with Dataset on GitHub.

1: Importing Modules and Reading the Dataset

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import sklearn.metrics as mt
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
 
df1 = pd.read_csv('../input/Womens Clothing E-Commerce Reviews.csv')
df = df1[['Review Text','Rating','Class Name','Age']]
#df1.info()
#df1.describe()
df1.head()

2: Adding the Word Counts to the data frame and Finding out How Many Times Some Words Were Used

# fill NA values by space
df['Review Text'] = df['Review Text'].fillna('')

# CountVectorizer() converts a collection 
# of text documents to a matrix of token counts
vectorizer = CountVectorizer()
# assign a shorter name for the analyze
# which tokenizes the string
analyzer = vectorizer.build_analyzer()

def wordcounts(s):
    c = {}
    # tokenize the string and continue, if it is not empty
    if analyzer(s):
        d = {}
        # find counts of the vocabularies and transform to array 
        w = vectorizer.fit_transform([s]).toarray()
        # vocabulary and index (index of w)
        vc = vectorizer.vocabulary_
        # items() transforms the dictionary's (word, index) tuple pairs
        for k,v in vc.items():
            d[v]=k # d -> index:word 
        for index,i in enumerate(w[0]):
            c[d[index]] = i # c -> word:count
    return  c

# add new column to the dataframe
df['Word Counts'] = df['Review Text'].apply(wordcounts)
df.head()

3: Demonstrating the Densities of Class Names, Some Selected Words, and All Words in the Reviews By Using WordCloud

# selecting some words to examine detailed 
selectedwords = ['awesome','great','fantastic','extraordinary','amazing','super',
                 'magnificent','stunning','impressive','wonderful','breathtaking',
                 'love','content','pleased','happy','glad','satisfied','lucky',
                 'shocking','cheerful','wow','sad','unhappy','horrible','regret',
                 'bad','terrible','annoyed','disappointed','upset','awful','hate']

def selectedcount(dic,word):
    if word in dic:
        return dic[word]
    else:
        return 0
    
dfwc = df.copy()  
for word in selectedwords:
    dfwc[word] = dfwc['Word Counts'].apply(selectedcount,args=(word,))
    
word_sum = dfwc[selectedwords].sum()
print('Selected Words')
print(word_sum.sort_values(ascending=False).iloc[:5])

print('\nClass Names')
print(df['Class Name'].fillna("Empty").value_counts().iloc[:5])

fig, ax = plt.subplots(1,2,figsize=(20,10))
wc0 = WordCloud(background_color='white',
                      width=450,
                      height=400 ).generate_from_frequencies(word_sum)

cn = df['Class Name'].fillna(" ").value_counts()
wc1 = WordCloud(background_color='white',
                      width=450,
                      height=400 
                     ).generate_from_frequencies(cn)

ax[0].imshow(wc0)
ax[0].set_title('Selected Words\n',size=25)
ax[0].axis('off')

ax[1].imshow(wc1)
ax[1].set_title('Class Names\n',size=25)
ax[1].axis('off')

rt = df['Review Text']
plt.subplots(figsize=(18,6))
wordcloud = WordCloud(background_color='white',
                      width=900,
                      height=300
                     ).generate(" ".join(rt))
plt.imshow(wordcloud)
plt.title('All Words in the Reviews\n',size=25)
plt.axis('off')
plt.show()

4: Viewing the Relation Between Rating, Class Name, and Age

df1=df['Rating'].value_counts().to_frame()
avgdf1 = df.groupby('Class Name').agg({'Rating': np.average})
avgdf2 = df.groupby('Class Name').agg({'Age': np.average})
avgdf3 = df.groupby('Rating').agg({'Age': np.average})

trace1 = go.Bar(
    x=avgdf1.index,
    y=round(avgdf1['Rating'],2),
    marker=dict(
        color=avgdf1['Rating'],
        colorscale = 'RdBu')
)

trace2 = go.Bar(
    x=df1.index,
    y=df1.Rating,
    marker=dict(
        color=df1['Rating'],
        colorscale = 'RdBu')
)

trace3 = go.Bar(
    x=avgdf2.index,
    y=round(avgdf2['Age'],2),
    marker=dict(
        color=avgdf2['Age'],
        colorscale = 'RdBu')
)

trace4 = go.Bar(
    x=avgdf3.index,
    y=round(avgdf3['Age'],2),
    marker=dict(
        color=avgdf3['Age'],
        colorscale = 'Reds')
)

fig = tools.make_subplots(rows=2, cols=2, print_grid=False)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 2, 2)

fig['layout']['xaxis1'].update(title='Class')
fig['layout']['yaxis1'].update(title='Average Rating')
fig['layout']['xaxis2'].update(title='Rating')
fig['layout']['yaxis2'].update(title='Count')
fig['layout']['xaxis3'].update(title='Class')
fig['layout']['yaxis3'].update(title='Average Age of the Reviewers')
fig['layout']['xaxis4'].update(title='Rating')
fig['layout']['yaxis4'].update(title='Average Age of the Reviewers')

fig['layout'].update(height=800, width=900,showlegend=False)
fig.update_layout({'plot_bgcolor':'rgba(0,0,0,0)',
                   'paper_bgcolor':'rgba(0,0,0,0)'})
#fig['layout'].update(plot_bgcolor='rgba(0,0,0,0)')
#fig['layout'].update(paper_bgcolor='rgba(0,0,0,0)')
py.iplot(fig)
cv = df['Class Name'].value_counts()

trace = go.Scatter3d( x = avgdf1.index,
                      y = avgdf1['Rating'],
                      z = cv[avgdf1.index],
                      mode = 'markers',
                      marker = dict(size=10,color=avgdf1['Rating']),
                      hoverinfo ="text",
                      text="Class: "+avgdf1.index+" \ Average Rating: "+avgdf1['Rating'].map(' {:,.2f}'.format).apply(str)+" \ Number of Reviewers: "+cv[avgdf1.index].apply(str)
                      )

data = [trace]
layout = go.Layout(title="Average Rating & Class & Number of Reviewers",
                   scene = dict(
                    xaxis = dict(title='Class'),
                    yaxis = dict(title='Average Rating'),
                    zaxis = dict(title='Number of Sales'),),
                   margin = dict(l=30, r=30, b=30, t=30))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
plt.savefig('3D_Scatter.png')

5: Building a Sentiment Classifier

# Rating of 4 or higher -> positive, while the ones with 
# Rating of 2 or lower -> negative 
# Rating of 3 -> neutral
df = df[df['Rating'] != 3]
df['Sentiment'] = df['Rating'] >=4
df.head()

# split data
train_data,test_data = train_test_split(df,train_size=0.8,random_state=0)
# select the columns and 
# prepare data for the models 
X_train = vectorizer.fit_transform(train_data['Review Text'])
y_train = train_data['Sentiment']
X_test = vectorizer.transform(test_data['Review Text'])
y_test = test_data['Sentiment']

6: Logistic Regression

start=dt.datetime.now()
lr = LogisticRegression()
lr.fit(X_train,y_train)
print('Elapsed time: ',str(dt.datetime.now()-start))

7: Naive Bayes

start=dt.datetime.now()
nb = MultinomialNB()
nb.fit(X_train,y_train)
print('Elapsed time: ',str(dt.datetime.now()-start))

8: Support Vector Machine (SVM)

start=dt.datetime.now()
svm = SVC()
svm.fit(X_train,y_train)
print('Elapsed time: ',str(dt.datetime.now()-start))

9: Neural Network

start=dt.datetime.now()
nn = MLPClassifier()
nn.fit(X_train,y_train)
print('Elapsed time: ',str(dt.datetime.now()-start))

10: Evaluating Models

# define a dataframe for the prediction probablities of the models
#df1 = train_data.copy()
#df1['Logistic Regression'] = lr.predict_proba(X_train)[:,1]
#df1['Naive Bayes'] = nb.predict_proba(X_train)[:,1]
#df1['SVM'] = svm.decision_function(X_train)
#df1['Neural Network'] = nn.predict_proba(X_train)[:,1]
#df1=df1.round(2)
#df1.head()

# define a dataframe for the predictions
df2 = train_data.copy()
df2['Logistic Regression'] = lr.predict(X_train)
df2['Naive Bayes'] = nb.predict(X_train)
df2['SVM'] = svm.predict(X_train)
df2['Neural Network'] = nn.predict(X_train)
df2.head()

11: ROC Curves and AUC

pred_lr = lr.predict_proba(X_test)[:,1]
fpr_lr,tpr_lr,_ = roc_curve(y_test,pred_lr)
roc_auc_lr = auc(fpr_lr,tpr_lr)

pred_nb = nb.predict_proba(X_test)[:,1]
fpr_nb,tpr_nb,_ = roc_curve(y_test.values,pred_nb)
roc_auc_nb = auc(fpr_nb,tpr_nb)

pred_svm = svm.decision_function(X_test)
fpr_svm,tpr_svm,_ = roc_curve(y_test.values,pred_svm)
roc_auc_svm = auc(fpr_svm,tpr_svm)

pred_nn = nn.predict_proba(X_test)[:,1]
fpr_nn,tpr_nn,_ = roc_curve(y_test.values,pred_nn)
roc_auc_nn = auc(fpr_nn,tpr_nn)

f, axes = plt.subplots(2, 2,figsize=(15,10))
axes[0,0].plot(fpr_lr, tpr_lr, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_lr))
axes[0,0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0,0].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[0,0].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Logistic Regression')
axes[0,0].legend(loc='lower right', fontsize=13)

axes[0,1].plot(fpr_nb, tpr_nb, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_nb))
axes[0,1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0,1].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[0,1].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Naive Bayes')
axes[0,1].legend(loc='lower right', fontsize=13)

axes[1,0].plot(fpr_svm, tpr_svm, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_svm))
axes[1,0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1,0].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[1,0].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Support Vector Machine')
axes[1,0].legend(loc='lower right', fontsize=13)

axes[1,1].plot(fpr_nn, tpr_nn, color='darkred', lw=2, label='ROC curve (area = {:0.2f})'.format(roc_auc_nn))
axes[1,1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1,1].set(xlim=[-0.01, 1.0], ylim=[-0.01, 1.05])
axes[1,1].set(xlabel ='False Positive Rate', ylabel = 'True Positive Rate', title = 'Neural Network')
axes[1,1].legend(loc='lower right', fontsize=13);

12: Confusion Matrices

# preparation for the confusion matrix
lr_cm=confusion_matrix(y_test.values, lr.predict(X_test))
nb_cm=confusion_matrix(y_test.values, nb.predict(X_test))
svm_cm=confusion_matrix(y_test.values, svm.predict(X_test))
nn_cm=confusion_matrix(y_test.values, nn.predict(X_test))

plt.figure(figsize=(15,12))
plt.suptitle("Confusion Matrices",fontsize=24)

plt.subplot(2,2,1)
plt.title("Logistic Regression")
sns.heatmap(lr_cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,2)
plt.title("Naive Bayes")
sns.heatmap(nb_cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,3)
plt.title("Support Vector Machine (SVM)")
sns.heatmap(svm_cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,4)
plt.title("Neural Network")
sns.heatmap(nn_cm, annot = True, cmap="Greens",cbar=False);

13: Precision-Recall – F1-Score

print("Logistic Regression")
print(mt.classification_report(y_test, lr.predict(X_test)))
print("\n Naive Bayes")
print(mt.classification_report(y_test, nb.predict(X_test)))
print("\n Support Vector Machine (SVM)")
print(mt.classification_report(y_test, svm.predict(X_test)))
print("\n Neural Network")
print(mt.classification_report(y_test, nn.predict(X_test)))

Conclusion

Sentiment analysis Python empowers data scientists to decipher emotions hidden within vast amounts of text data. By leveraging Python’s capabilities and the ever-evolving field of data science projects, sentiment analysis becomes an indispensable tool for understanding and harnessing the power of sentiments.

2 thoughts on “Sentiment Analysis Python: How To – Data Science Projects”

Leave a Comment