www.tlab.it

Thematic Analysis of Elementary Contexts


This T-LAB tool allows you to obtain and explore a representation of corpus contents through few and significant thematic clusters (from 3 to 50), each of which:

a) consists of a set of elementary contexts (i.e. sentences, paragraphs or short texts like responses to open-ended questions) characterized by the same patterns of key-words;

b) is described through the lexical units (words, lemmas or categories) and the variables (if present) most characteristic of the context units from which it is composed.

In many ways, analysis results can be considered as an isotopy (iso = same; topoi = places) map where each of them, as generic or specific theme (Rastier, 2002: 204), is characterized by the co-occurrences of semantic traits.

A T-LAB dialog box (see above) allows the user to set some analysis parameters.
In particular:
- the (A) parameter allows the user to fix the maximum number of cluster partitions to be included in T-LAB outputs. Nonetheless, the clustering algorithm used stops when any further partition doesn't match statistical criteria;
- the (B) parameter allows the user to exclude from the analysis any context unit that doesn't contain a minimum number of key-words included in the list which is being used.
N.B.: Both the above parameters produce significant changes in the analysis results only when the number of context units is very large and/or when they are short texts.

The analysis procedure consists of the following steps:

a - construction of a data table context units x lexical units (up to 150,000 rows x 1,500 columns), with presence/absence values;
b - TF-IDF normalization and scaling of row vectors to unit length (Euclidean norm);
c - clustering of the context units (measure: cosine coefficient; method: bisecting K-means);
d - filing of the obtained partitions and, for each of them:
e- construction of a contingency table lexical units x clusters (n x k);
f- chi square test applied to all the intersections of the contingency table;
g- correspondence analysis of the contingency table lexical units x clusters.

This procedure therefore performs a type of co-occurrence analysis (steps a-b-c) and, subsequently, a type of comparative analysis (steps e-f-g). In particular, comparative analysis uses the categories of the "new variable" derived from the co-occurrence analysis (categories of the new variable = thematic clusters) to form the contingency table columns.

N.B.: When the user decides to repeat/apply the results of a previous analysis (i.e. a Thematic Analysis of Elementary Contexts or a Modeling of Emerging Themes), T-LAB performs a comparative analysis only (steps e-f-g).

On completion of the analysis you can easily perform the following operations:

1 - explore the characteristics of the clusters;
2 - explore the relationships between the clusters;
3 - explore the relationships between clusters and variables;
4 - explore the various cluster partitions (from 3 to 50);
5 - refine the results of the chosen partition and, if necessary, repeat the above steps (1,2,3);
6 - assign labels to the clusters;
7 - verify which elementary contexts belong to each cluster;
8 - verify the score of each elementary context within the cluster to which it belongs;
9 - export a thematic document classification (only provided when the corpus is made up of at least 2 primary documents and when they are not short texts like responses to open ended questions);
10 - save the selected partition for exploration with other T-LAB tools.

In details:

1 - Explore the characteristics of the clusters


Clicking on the CHARACTERISTICS button shows the lexical units and the variable values which characterize each cluster: Chi-square values and the sums of the elementary contexts in which it is found, both in the selected cluster ("IN CLUST") and in the analysed total ("IN TOT"). The "CAT" column also indicates whether the characteristic has been selected by the user ("A") with the Customized Settings function or has been suggested by T-LAB as a "supplementary" description ("S").

In the case of the chi square test the structure of the analysed table is the following:

Where:
nij refers to occurrences of word (a) within the selected cluster (A)
Nj refers to all occurrences of word (a) within the corpus (or the corpus sub-set) analysed
Ni refers to all word occurrences within the selected cluster (A)
N refers to all word occurrences of the contingency table word by cluster.

An HTML report (see below) is generated to permit detailed analysis of the cluster characteristics. In the report, in addition to the list of typical words, the most characteristic elementary contexts of the selected cluster are shown in descending order according to their respective score.

Pie charts and bar charts are used to verify the percentage of context units (i.e. elementary contexts) that belong to each cluster.

 

2 - Explore the relationships between the clusters

Some of the graphs obtained by Correspondence Analysis enable you to explore the relationships between clusters in bidimensional spaces.
More specifically:
- You can explore the various combinations of factorial axes, simply by selecting them in the appropriate boxes ("X axis", " Y axis");
- For each of the combinations (X-Y), you can display various types of elements (clusters, lemmas and variables).

All the graphs can be maximized and customized by using the appropriate dialog box (just right click on the chart). Moreover, when thematic clusters are 4 or more, their relationships can be explored through 3d moving (see below).

Moreover, for every factorial axis, T-LAB supplies tables that facilitate the interpretation.
These are shown after every selection in the appropriate boxes (see below).

By selecting the Complete Results option it is possible to check all the results of the Correspondence Analysis lexical units x clusters.

A specific option (see below) allows us to visualise/export the contingency table and to create charts showing the distribution of each word within the clusters and their corresponding chi-square value.
Moreover, by clicking on specific cells of the table, it is possible to create a HTML file including all elementary contexts where the word in row is present in the corresponding cluster.

 

3 - Explore the relationships between clusters and variables

Bar charts allow you to verify the relationships between clusters and variables.

You can explore additional relationships between clusters and variables using the functions provided in the Factor Analysis section (see above).

4 - Explore the various cluster partitions

Because the algorithm used (bisecting K-means) produces a hierarchical clustering, the user can explore various analysis solutions: partitions from 3 to 50 clusters.

For each partition obtained, a specific table (see below) lists the following values:
- "Index", obtained by dividing the between cluster variance by the total variance;
- "Gap", corresponding to the difference between the index value and the value of the immediately previous partition:
- the number of the "child" cluster obtained from the bisection of the corresponding "parent".

The Partition option allows you to easily explore the characteristics of the available clustering solutions (just click on a table item).

The dendrogram function (see below) allows you to check the tree structure of the various bisections.

5 - Refine the results of the chosen partition

After having explored different solutions, the user can refine the results of the chosen partition and, if necessary, repeat some of the three operations above illustrated.

In particular, this step allows the user to delete from the analysis all context units of which cluster membership doesn't fit either of the following criteria:
a) the cluster memberships of the i-context unit, determined by the bisecting K-means first (unsupervised clustering) and by a Naïve Bayes Classifier later (supervised clustering), must be the same;
b) the maximum posterior value (see below) corresponding to the i-context unit cluster membership must be, in percentage terms, at least 50% higher than its remaining values (i.e. posterior value in other clusters).

All the results of this computation are in the following table exported by T-LAB (see below), where the posteriori values for each cluster are in percentage format.

6 - Assign labels to the clusters

A specific T-LAB function allows you to assign labels to clusters.
(N.B: The software proposes a number of labels automatically the first time you use this function.)

Labels assigned to clusters can be displayed in the various graphs available (see below).

7 - Verify which elementary contexts belong to each cluster
8 - Verify the score of each elementary context within the cluster to which it belongs

9 - Obtain a thematic document classification

In fact the Cluster Membership button lets you export three types of tables (see below) in MS Excel format:

a - "Cluster_Partitions.xls" listing all the context unit correspondence for each cluster within the various partitions;


b - "Themes-Contexts.xls" (see below) listing the context unit correspondences for each cluster within the selected partition.

In particular, the relevance value (Score) assigned to each elementary context (j) belonging to the cluster (k) comes from the following formula:

Where:

Scorej = relevance value assigned to the elementary context (j);
SXij = sum of the Chi-square values assigned to the key-words (i) found in the elementary context in question (j) which are "typical" of the cluster (k);
nj = number of key-words (distinct words), typical of the cluster (k), found in the elementary context (j);
N = number of key-words (distinct words) typical of the cluster (k).

c - "Ec_Document_Classification.xls" (only provided when the corpus is made up of at least 2 primary documents at least and when they are not short texts like responses to open ended questions) listing the mixed cluster membership of each document (see below).


In this case the values come from the above formula (see "b") by summing the scores of elementary contexts belonging to each document and by applying a percentage calculation.

10 - Save the selected partition for exploration with other T-LAB tools

When you exit the Thematic Analysis of Elementary Contexts function, the software displays messages to remind you that you can use other T-LAB tools to explore the clusters obtained.

If you select Save, the < CONT_CLUST > variable (clusters of elementary contexts) remains available only for certain types of analysis (e.g. Sequences of Themes, Word Associations, Comparison between Pairs of Key-Words, Co-Word Analysis and Concept Mapping) and until the user modifies his word list.