www.tlab.it

Thematic Document Classification


This function is only enabled if the corpus under analysis includes at least 20 primary documents (max 30,000).

You can use this function to construct document clusters and explore their characteristics by means of operations/functions similar to those described in the section of the help dedicated to Thematic Analysis of Elementary Contexts.

The specificity of this function is that the table analysed consists of one line for each document in the corpus, each of which is represented as a vector of values indicating the occurrences of the words found in it.

Moreover the following outputs are different:

The documents belonging to each cluster are ordered by their decreasing relevance value (see above) and can be browsed in HTML format.


In this case the relevance value (score) assigned to each document (i) in the cluster (k) is obtained by applying the following formula:

Where:

i - refers to document i;
k - refers to cluster k;
cos - is the cosine symbol;
di - is the normalized vector of TFj,i IDFj , where j refers to word in document i;
ck - is the normalized vector of TFj,k IDFj, where j refers to word in cluster k;

By using the scores obtained by the above formula, transformed into percentage values, the file "Document_Membership_Degree.xls" (see below) - containing the clusters to which the documents are assigned, either by the bisecting K-Means (mutual exclusive memberships) or the TF-IDF (mixed or fuzzy memberships) - is made available by T-LAB.

When you exit this function, the software displays messages to remind you that you can use other T-LAB tools to explore the clusters obtained.

 

If you select "Save", the < DOC_CLUST> variable (document cluster) remains available for all subsequent analyses of the corpus performed with other T-LAB tools.