www.tlab.it

Normalization


In T-LAB, corpus normalization has the double goal of:

a) allowing correct word detection as raw forms;

b) solving some ambiguity cases.

This means that T-LAB, in the first place, carries out a number of processes on the file under analysis: blank space in excess elimination, apostrophe marking, space addition after punctuation marks, capital letter reduction, etc.

Secondly, T-LAB marks a set of strings recognized as proper nouns; then converts the sequences of row forms recognized as multiwords in unitary strings, in order to use them in that form during the analysis process ("in terms of" and "point of view" become respectively "in_terms_of" and "point_of_view").

These operation parameters cannot be modified by the user.


In order to have a correct recognition of raw forms, in the normalization routine, T-LAB uses the following marks:

, ; : . ! ? ' " ( ) < > + / = [ ] { }