www.tlab.it

Modeling of Emerging Themes


This T-LAB tool provides a simple way of discovering, examining and modeling, the main themes or topics (henceforward 'theme' and 'topic' will be used synonymously) emerging from texts. Subsequently they can be explored further with several tools, either by keeping separate or by combining qualitative and quantitative approaches.

In fact, themes - which are described through their characteristic vocabulary and consist of co-occurrence patterns of key-terms - can be used as categories in further analyses or for automatically classifying the context units (i.e. documents or elementary contexts).

The only parameter (see below) that the user can set is the amount (i.e. a fixed number) of themes to be obtained. Note that the higher this number is the more consistent are the co-occurrence patterns; moreover, if necessary, some themes (e.g. those that are redundant or difficult to interpret) can be discarded later.

The analysis procedure consists of the following steps:

a - construction of a co-occurrence matrix (depending on the cases, either a document by word or a elementary context by word matrix);
b - data analysis by a probabilistic model which uses the Latent Dirichlet Allocation and the Gibbs Sampling (see the related information on Wikipedia: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation; http://en.wikipedia.org/wiki/Gibbs_sampling;
c - description of themes by means of the probability of their characteristic words, either "specific" or "shared" by two or more themes.

On completion of the analysis you can easily perform the following operations:

1 - explore, rename and remove the characteristics of each theme;

2 - rename or discard specific themes;

3 - test the model by a Naïve Bayes Classifier which assigns context units (i.e. documents and/or elementary contexts) to themes;

4 - apply the model and visualize the relationships between the different themes.


In detail:

1 - Explore, rename and remove the characteristics of each theme

In this chart (see above) "hight probability" indicates a probability >=.75.

By clicking on each theme label (see "A" above), tables and charts can be visualized (see "B" above); moreover, by clicking on words in the table (see "C" above), their distribution within the various themes is displayed and a "remove" option is available.


The reading keys of the table are as follows:
IN THEME = tokens of each word within the selected theme;
TOT = total tokens of each word within the corpus (or the subset) analysed;
IN (%) = percentage values of each word within the selected theme;
(p) = probability value of each word over themes;
TYPE = specific when the word belongs to the selected theme only (i.e. p=1); shared in the other cases.

By selecting the complete results option (see "B" above) a HTML file is created including all themes and their characteristic vocabulary; moreover, two XLS files can be saved.


When the "shared words" option is selected (see below) it is possible to explore the corresponding table and create a chart for each item selected.


2 - Rename or discard specific themes

In order or discard specific themes, just select one of them (see "A" below) and click on the "rename/remove" button (see "B" below).

When the appropriate box appears, depending on your goals, you can change the label by choosing among the available words or by typing a new label in the appropriate field (see "C" below); otherwise you can discard the selected theme just by clicking on the corresponding button (see "D" below)



3 - Test the Model

At the end of the analysis procedure (see above the "a" and "b" points) each context unit (i.e. primary documents or elementary contexts) is represented as mixture of different topics; differently the Naïve Bayes Classifier used in this step assigns each context unit to the topic which is the most characteristic of it.
For this reason, when the "Test the Model" option is selected, T-LAB creates a HTML file and two XLS files including the classification of contexts units (see below).


5 - Apply the model


After having applied and saved the model (see "A" below), the results of analysis can be immediately visualised by a MDS map.


Moreover, given that after exiting from the analysis (see "B" above) themes are recorded as clusters of context units (i.e. like the Thematic Analysis of Elementary Contexts and Thematic Classification of Documents results), the new thematic variables just created (i.e. CONT_CLUST and/or DOC_CLUST) can be explored by using various T-LAB tools (see below).

For example, you can perform a Correspondence Analysis of themes (see below)


produce a network map (see below) by using the Sequence of Themes tool


obtain Word Associations map by using the corresponding T-LAB tool (see below) and so on.