T-LAB Home
T-LAB 10.2 - ON-LINE HELP Prev Page Prev Page
T-LAB
Introduction
What T-LAB does and what it enables us to do
Requirements and Performances
Corpus Preparation
Corpus Preparation
Structural Criteria
Formal Criteria
File
Import a single file...
Prepare a Corpus (Corpus Builder)
Open an existing project
Settings
Automatic and Customized Settings
Dictionary Building
Co-occurrence Analysis
Word Associations
Co-Word Analysis and Concept Mapping
Comparison between Word pairs
Sequence and Network Analysis
Concordances
Co-occurrence Toolkit
Thematic Analysis
Thematic Analysis of Elementary Contexts
Modeling of Emerging Themes
Thematic Document Classification
Dictionary-Based Classification
Texts and Discourses as Dynamic Systems
Comparative Analysis
Specificity Analysis
Correspondence Analysis
Multiple Correspondence Analysis
Cluster Analysis
Singular Value Decomposition
Lexical Tools
Text Screening / Disambiguations
Corpus Vocabulary
Stop-Word List
Multi-Word List
Word Segmentation
Other Tools
Variable Manager
Advanced Corpus Search
Classification of New Documents
Key Contexts of Thematic Words
Export Custom Tables
Editor
Import-Export Identifiers list
Glossary
Analysis Unit
Association Indexes
Chi-Square
Cluster Analysis
Coding
Context Unit
Corpus and Subsets
Correspondence Analysis
Data Table
Disambiguation
Dictionary
Elementary Context
Frequency Threshold
Graph Maker
Homograph
IDnumber
Isotopy
Key-Word (Key-Term)
Lemmatization
Lexical Unit
Lexie and Lexicalization
Markov Chain
MDS
Multiwords
N-grams
Naïve Bayes
Normalization
Occurrences and Co-occurrences
Poles of Factors
Primary Document
Profile
Specificity
Stop Word List
Test Value
Thematic Nucleus
TF-IDF
Variables and Categories
Words and Lemmas
Bibliography
www.tlab.it

Modeling of Emerging Themes


This T-LAB tool provides a simple way of discovering, examining and modeling, the main themes or topics (henceforward 'theme' and 'topic' will be used synonymously) emerging from texts. Subsequently they can be explored further with several tools, either by keeping separate or by combining qualitative and quantitative approaches.

In fact, themes - which are described through their characteristic vocabulary and consist of co-occurrence patterns of key-terms - can be used as categories in further analyses or for automatically classifying the context units (i.e. documents or elementary contexts).


A T-LAB dialog box (see above) allows the user to set two analysis parameters.
In particular:
- the (A) parameter allows the user to set the amount (i.e. a fixed number) of themes to be obtained. (Note that the higher this number is the more consistent the co-occurrence patterns are; moreover, if necessary, some themes - e.g. those that are redundant or difficult to interpret - can be discarded later);
- the (B) parameter allows the user to exclude from the analysis any context unit that doesn't contain a minimum number of key-words included in the list which is being used.

Only when choosing to customize all analysis parameters (see the above 'Yes' option'), the following window will appears and more options will be available. (Note that in the below picture the number of context units is determined by the above parameter 'B').

The analysis procedure consists of the following steps:

a - construction of a document per word matrix, where documents are always elementary contexts corresponding to the context units (i.e. fragments, sentences, paragraphs) in which the corpus has been subdivided into;
b - data analysis by a probabilistic model which uses the Latent Dirichlet Allocation and the Gibbs Sampling (see the related information on Wikipedia: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation; http://en.wikipedia.org/wiki/Gibbs_sampling;
c - description of themes by means of the probability of their characteristic words, either specific or shared by two or more themes.

On completion of the analysis you can easily perform the following operations:

1 - explore the characteristics of each theme;

2 - explore the relationships between the various themes;

3 - rename or discard specific themes;

4- assess the semantic coherence of each theme;

5 - test the model and assign context units (i.e. documents and/or elementary contexts) to themes;

6 - apply the model by creating a new thematic variable, the values of which are the chosen topics;

7 - export a dictionary of categories, which can be used in further analyses.


In detail:

1 - Explore the characteristics of each theme

An overview of all themes is the first output which can be checked and saved, and it can be easily re-accessed by using the 'Preview' button (see below).

Other kinds of outputs are accessible by choosing the options highlighted in the below picture.

N.B.: In the above chart 'high probability' indicates a probability >=.75.

When a topic is selected, by clicking the 'table theme' option, you can check its characteristics and - by clicking on any word in the table - a remove option becomes available (see the below picture).


The reading keys of the above table are as follows:


IN THEME = tokens of each word within the selected theme;
TOT = total tokens of each word within the corpus (or the subset) analysed;
IN (%) = percentage values of each word within the selected theme;
(p) = probability value of each word over themes;
TYPE = specific when the word belongs to the selected theme only (i.e. p=1); shared in the other cases.

When a topic is selected, by clicking the 'MDS Map' option, the semantic relationships between its most characteristic words can be easily explored (see the below picture).

Moreover, by using the 'Graph Maker' tool, more graphic options become available (see the below pictures).

 

When a topic is selected, by clicking the 'meaningful contexts' option, a HTML file is created where the top 20 text segments - which most closely match the topic characteristics - are displayed (see the below picture).

1


2 - Explore the relationships between the various themes

Two kinds of contingency tables can be created and explored through the Correspondence Analysis tool:

2.1) a word per topic table (see below)

2.2) a topic per variable table (N.B.: In the below chart the nine bubbles correspond to the chapters of one of Obama's book)

Two more graphic options are available which allow us to map the relationships between the various topics/themes:

2.3) a MDS Map

2.4) a Network Graph obtained by exporting/importing the adjacency table created by T-LAB (see below)

N.B.: The above graph has been created by means of Gephi (https://gephi.org/ ), which is an open source software, after importing a table created by T-LAB.

3 - Rename or discard specific themes

In order or discard specific themes, just select one of them (see "A" below) and click on the "rename/remove" button (see "B" below).

When the appropriate box appears, depending on your goals, you can change the label by choosing among the available words or by typing a new label in the appropriate field (see "C" below); otherwise you can discard the selected theme just by clicking on the corresponding button (see "D" below).


4 - Assess the semantic coherence of each theme

When clicking the Quality Indices button (see the picture above), T-LAB computes the average similarity between the top 10 words of each theme.
More specifically:
- the top 10 words are those with the highest probability values over themes;
- the average similarity is computed using the cosine index;
- the cosine index of each word pairs, like the Word Associations tool, is computed at the text segment (i.e. elementary context) level .
As a result, T-LAB creates a HTML table where the 'k' themes are listed according to their 'semantic coherence' (i.e. the first theme in the list is the one with the highest average similarity index).
N.B.: Because the above measures vary according to the selected words, the user is advised to repeat the procedure each time any of the top 10 words of each theme is removed.

5 - Test the Model

At the end of the analysis procedure (see above the "a" and "b" points) each context unit (i.e. primary documents or elementary contexts) is represented as mixture of different topics; differently the classification process used in this step assigns each context unit to the topic which is the most characteristic of it.

For this reason, when the "Test the Model" option is selected, T-LAB creates two files including the classification of contexts units (see below).

For In the above table, each document has a probability value associated with each topic.


6 - Apply the model


After having applied and saved the model
, given that after exiting from the analysis (see "B" above) themes are recorded as clusters of context units (i.e. like the Thematic Analysis of Elementary Contexts and Thematic Classification of Documents results), the new thematic variables just created (i.e. CONT_CLUST and/or DOC_CLUST) can be explored by using various T-LAB tools (see below).

For example, by using the Word Associations tool and by selecting the sub-set (i.e. topic) 'Religion' the following graph can be created.


7 - Export a dictionary of categories

When this option is selected a dictionary file with the .dictio extension is created which is ready to be imported by any T-LAB tool for thematic analysis. In such a dictionary each theme (or category) is described by its characteristic words.