T-LAB Home
T-LAB 10.2 - ON-LINE HELP Prev Page Prev Page
T-LAB
Introduction
What T-LAB does and what it enables us to do
Requirements and Performances
Corpus Preparation
Corpus Preparation
Structural Criteria
Formal Criteria
File
Import a single file...
Prepare a Corpus (Corpus Builder)
Open an existing project
Settings
Automatic and Customized Settings
Dictionary Building
Co-occurrence Analysis
Word Associations
Co-Word Analysis and Concept Mapping
Comparison between Word pairs
Sequence and Network Analysis
Concordances
Co-occurrence Toolkit
Thematic Analysis
Thematic Analysis of Elementary Contexts
Modeling of Emerging Themes
Thematic Document Classification
Dictionary-Based Classification
Texts and Discourses as Dynamic Systems
Comparative Analysis
Specificity Analysis
Correspondence Analysis
Multiple Correspondence Analysis
Cluster Analysis
Singular Value Decomposition
Lexical Tools
Text Screening / Disambiguations
Corpus Vocabulary
Stop-Word List
Multi-Word List
Word Segmentation
Other Tools
Variable Manager
Advanced Corpus Search
Classification of New Documents
Key Contexts of Thematic Words
Export Custom Tables
Editor
Import-Export Identifiers list
Glossary
Analysis Unit
Association Indexes
Chi-Square
Cluster Analysis
Coding
Context Unit
Corpus and Subsets
Correspondence Analysis
Data Table
Disambiguation
Dictionary
Elementary Context
Frequency Threshold
Graph Maker
Homograph
IDnumber
Isotopy
Key-Word (Key-Term)
Lemmatization
Lexical Unit
Lexie and Lexicalization
Markov Chain
MDS
Multiwords
N-grams
Naïve Bayes
Normalization
Occurrences and Co-occurrences
Poles of Factors
Primary Document
Profile
Specificity
Stop Word List
Test Value
Thematic Nucleus
TF-IDF
Variables and Categories
Words and Lemmas
Bibliography
www.tlab.it

Thematic Document Classification


This function is only enabled when the corpus under analysis includes from 20 (min) to 99,999 (max) primary documents.

The analysis process can be performed through an unsupervised clustering (i.e bottom-up approach), which is the default option, or a supervised classification (i.e. top-down approach). When choosing the latter (i.e. supervised classification), a dictionary of categories must be imported, either created by means of a previous T-LAB analysis or made up by the user.

You can use this function to construct document clusters and explore their characteristics by means of operations (including algorithms) similar to those described in the section of the help dedicated to Thematic Analysis of Elementary Contexts.

The specificity of this function is that the table analysed consists of one line for each document in the corpus, each of which is represented as a vector of values indicating the occurrences of the words found in it.

N.B.: When doing an unsupervised clustering and the number of analysed documents doesn't exceed 3,000, it is possible to obtain similarity measures (i.e. cosine) between each pair of them (see below). However only the similarities with a cosine coefficient greater or equal to 0.05 are recorded.

Accordingly the following outputs are different:

The documents belonging to each cluster are ordered by their decreasing relevance value (see above) and can be browsed in HTML format.


In this case the relevance value (score) assigned to each document (i) in the cluster (k) is obtained by applying the following formula:

Where:

i - refers to document i;
k - refers to cluster k;
cos - is the cosine symbol;
di - is the normalized vector of TFj,i IDFj , where j refers to word in document i;
ck - is the normalized vector of TFj,k IDFj, where j refers to word in cluster k;

By using the scores obtained by the above formula, transformed into percentage values, the file "Document_Membership_Degree.xls" (see below) - containing the clusters to which the documents are assigned, either by the bisecting K-Means (mutual exclusive memberships) or the TF-IDF (mixed or fuzzy memberships) - is made available by T-LAB.

When the Document Similarity button is enabled, by clicking it is possible to check how each document is similar to the others.
As in other cases, the similarity measure is the cosine coefficient and this can vary according to how many features (i.e. words) have been used for the thematic classification.
Below is a short description of how this tool works.

When you exit this function, the software displays messages to remind you that you can use other T-LAB tools to explore the clusters obtained.

 

If you select "Save", the < DOC_CLUST> variable (document cluster) remains available for all subsequent analyses of the corpus performed with other T-LAB tools.