ticclat.tokenize module¶
Generators that produce term-frequency vectors of documents in a corpus.
A document in ticclat is a term-frequency vector (collections.Counter). This module contains generators that return term-frequency vectors for certain types of input data.
-
ticclat.tokenize.
terms_documents_matrix_ticcl_frequency
(in_files)[source]¶ Returns a terms document matrix and related objects of a corpus
A terms document matrix contains frequencies of wordforms, with wordforms along one matrix axis (columns) and documents along the other (rows).
- Inputs:
- in_files: list of ticcl frequency files (one per document in the
- corpus)
Returns: a sparse terms documents matrix vocabulary: the vectorizer object containing the vocabulary (i.e., all word forms in the corpus)Return type: corpus
-
ticclat.tokenize.
terms_documents_matrix_word_lists
(word_lists)[source]¶ Returns a terms document matrix and related objects of a corpus
A terms document matrix contains frequencies of wordforms, with wordforms along one matrix axis and documents along the other.
- Inputs:
- word_lists: iterator over lists of words
Returns: a sparse terms documents matrix vocabulary: the vectorizer object containing the vocabulary (i.e., all word forms in the corpus)Return type: corpus