Analyze Chinese data15 October 2018
How to perform a Sentiment Analysis
(by Andrea Nobile - January 16th 2020)
This example shows how T-LAB can be used for sentiment analysis tasks like classifying short texts (e.g. tweets, movie reviews, etc.) according to their opinion polarity, i.e. as ‘negative’, ‘positive’ or ‘neutral’. However the same logic can be applied to other cases of top-down (i.e. supervised) classification which use two or more categories.
First of all, as T-LAB doesn’t have a built-in dictionary for sentiment analysis , the user needs to have at their disposal a well-tested dictionary which is in the correct format. If this is the case, the process is quite straightforward (see ‘First Option’ below); otherwise various steps must be performed first (see ‘Second Option’ below):
1 - FIRST OPTION
When the user has at their disposal a well-tested dictionary which is in the correct format the process is quite straightforward; in fact just three simple steps are required (see the pictures below):
1- select a T-LAB tool which allows performing a ‘supervised classification’ (e.g. Thematic analysis of Elementary Contexts
, Thematic Document Classification
or Dictionary-Based Classification
2-import the dictionary file;
3-explore the results.
2 - SECOND OPTION
The first time I used T-LAB for a sentiment analysis I didn’t have a well-tested dictionary at my disposal; thus I decided to perform the following steps which I would recommend to anyone who finds themselves in the same position I was:
a) select/build a training dataset that includes lots of texts pre-classified into the three above mentioned categories (i.e. ‘negative’, ‘positive’ and ‘neutral’);
b) select/customize a dictionary which includes a list of key-terms grouped into the three above mentioned categories and which – within each category – have a weight varying from a minimum to a maximum;
c) test the classification ‘accuracy’, which is defined as follows
d) perform further tests for tuning both the model and the dictionary used;
e) use the tested dictionary in further sentiment analyses, i.e. perform the three simple steps quoted at the beginning of this document (see ‘First Option’ above).
A - THE TRAINING DATASET
The training dataset I used is made available by The Stanford NLP Group (see https://nlp.stanford.edu/sentiment/code.html
To be more specific, in order to perform my test, I proceeded as follows:
1-I downloaded a zipped folder including various txt files and a README;
2-after consulting the README I merged two files, namely ‘dictionary.txt’ including 239,232 text fragments and ‘sentiment_labels.txt’ including the sentiment scores assigned to the various text fragments;
3-as for the ‘accuracy’ measurement it makes sense to filter out the ‘neutral’ cases, I selected 102,250 text fragments which were classified either ‘positive’ or ‘negative’ and I built a T-LAB corpus accordingly.
The corpus ready to be imported by T-LAB was a txt file subdivided in two sections, each of them preceded by a simple coding line (see below) and where each text fragment ended with a full stop and carriage return (this for instructing T-LAB that every record/line should be considered as a distinct elementary context).
After importing the above corpus (see the ‘Open Corpus’ option of the Corpus Builder
), in order to assess the correctness of the applied procedure, I used a T-LAB tool (i.e. Specificity Analysis
) which – either by using the Test Value or the Chi Square test – allows us to compare the lexicons of two corpus subsets (i.e. the ‘positive’ and the ‘negative’ texts).
The outputs which I have obtained (see pictures below) confirm that I was following the ‘right’ path.
N.B.: When analysing a corpus which includes a good classification, the Specificity Analysis
tool can be also used for exporting a dictionary ready to be used in further analyses (see the picture below)
B - THE DICTIONARY
Over the Internet there are lots of dictionaries for sentiment analysis which are freely available.
After testing a few of them, I have decided to adopt the so called ‘Subjectivity Lexicon’ made available by the MPQA project (see https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
Such a lexicon includes 8,222 records, each of them referring to a word classified for ‘type’, ‘pos’ (i.e. part of speech), ‘polarity’ etc. (see some examples below).
As each dictionary file imported in T-LAB must have a specific format, by using a text editor and a spreadsheet I rearranged the ‘Subjectivity Lexicon’ as follows:
- I interpreted the two categories ‘weaksubj’ and ‘strongsubj’ as referring to the ‘intensity’ and, accordingly, I transformed them in two different integers: weaksubj=1, strongsubj=3;
- the dictionary that I saved in the T-LAB format is a sort of CSV file (semicolon delimited format) the extension of which is ‘.dictio’ (i.e. ‘MPQA.dictio’) as required by the software.
C - THE ACCURACY TEST
In order to perform the accuracy test, after importing the above mentioned corpus (see section ‘A’ above), I used the T-LAB tool Dictionary-Based Classification
. The reason why I used such a tool is that my corpus included only two documents . In other cases, i.e. when the corpus consists of hundreds or thousands of documents, the Thematic Document Classification
can be used (see section ‘D’ below) which is more intuitive and provides a wider range of outputs.
Step by step:
1- I imported the dictionary named ‘MPQA.dictio’ (see section ‘B’ above)
2- I executed the classification by using the default options (see the picture below)
3- I clicked the button which allows us to build a two-ways table with predicted and actual values (see the image below)
4- I exported the two-ways table and computed the accuracy by using a spreadsheet (see the image below)
- The reason why only 44,367 out of the 102,250 Elementary Contexts resulted correctly classified is that T-LAB only processes context units which include at least two keywords of the corpus list;
- According to the scientific literature, when computing the accuracy of the sentiment analysis only the ‘positive’ and ‘negative’ cases are taken into consideration;
- As stated in Wikipedia (see https://en.wikipedia.org/wiki/Sentiment_analysis
), “according to research human raters typically only agree about 80% of the time (see Inter-rater reliability). Thus, a program that achieves 70% accuracy [see definition ‘c’ above] in classifying sentiment is doing nearly as well as humans, even though such accuracy may not sound impressive”.;
-T-LAB allows us to verify the results in many ways, including a file which contains all text segments and the polarities assigned by the dictionary (see the image below) .
D – FURTHER TESTS
When exploring the datasets which, to this purpose, are available over the Internet I discovered a CSV file which includes 14,640 tweets concerning airline companies. The tweets are pre-classified as ‘positive’, ‘negative’ or ‘neutral’ and a first analysis gives the following results:
I was intrigued by such a file and, through a simple re-ordering function, I discovered that - without considering the name of the company - lots of these tweets consists of just a word and that sometimes they are assigned to different categories even when they are identical (see the below image).
Subsequently I decided to import the above file into T-LAB by using only the two main columns (i.e. ‘sentiment’ and ‘text’) and I performed some tests.
Below is a picture which refers to the first step (N.B.: for privacy reasons, the names of the companies are greyed).
Like in the case of the Stanford corpus (see the above section ‘A’) I used the T-LAB Specificity Analysis
tool for comparing the lexicons of two corpus subsets. Here are the main results.
As some keywords which appeared to be typically ‘positive’ or ‘negative’ in this corpus were not included in the original MPQA dictionary, I decided to add some of these (see the above words marked in red) to my MPDA.dictio file. In fact, to boost the quality of results, in my opinion every dictionary should be somehow customized.
Regarding the top-down (i.e. supervised) classification of the tweets, I decided to use the T-LAB tool named Thematic Document Classification
. This is because such a tool allows us to save the results and to use the obtained classification for further analyses.
At the end of the process, only 6,697 tweets (45.74%) resulted rightly classified. This is because, as explained in the above section ‘B’, T-LAB only processes context units which include at least two keywords of the corpus list.
Interestingly, the percentages of the three polarities obtained by using the dictionary-based classification turned out to be quite different from the original polarities coded in the dataset (see the image below). And the accuracy turned out to be 71.00%.
In general, my main interest is not to ascertain how many people express ‘negative’ (or ‘positive’) opinions about a product or a service, but rather to understand WHY they express such opinions. And, to this purpose, I found several exploratory tools available in the T-LAB menu very helpful.
Here I will quote just two of them.
As through the previous analysis I saved the new classification of the tweets in three categories (i.e. corpus subsets), I decided to perform a thematic analysis of just one subset, i.e. that including only ‘negative’ opinions. The T-LAB tool I used to this purpose (i.e. Thematic analysis of Elementary Contexts
) provides us with several outputs; however here I inserted just the summary table (see below) which helped me to better understand the different reasons ‘WHY’ customers express ‘negative’ opinions about the services provided by the airline companies.