ticclat.tokenize module

Generators that produce term-frequency vectors of documents in a corpus.

A document in ticclat is a term-frequency vector (collections.Counter). This module contains generators that return term-frequency vectors for certain types of input data.

ticclat.tokenize.do_nothing(list_of_words)[source]

Return the argument unchanged.

ticclat.tokenize.terms_documents_matrix_ticcl_frequency(in_files)[source]

Returns a terms document matrix and related objects of a corpus

A terms document matrix contains frequencies of wordforms, with wordforms along one matrix axis (columns) and documents along the other (rows).

Inputs:
in_files: list of ticcl frequency files (one per document in the
corpus)
Returns:a sparse terms documents matrix vocabulary: the vectorizer object containing the vocabulary (i.e., all word forms
in the corpus)
Return type:corpus
ticclat.tokenize.terms_documents_matrix_word_lists(word_lists)[source]

Returns a terms document matrix and related objects of a corpus

A terms document matrix contains frequencies of wordforms, with wordforms along one matrix axis and documents along the other.

Inputs:
word_lists: iterator over lists of words
Returns:a sparse terms documents matrix vocabulary: the vectorizer object containing the vocabulary (i.e., all word forms
in the corpus)
Return type:corpus
ticclat.tokenize.ticcl_frequency(in_files, max_word_length=255)[source]

Generate word-frequency pairs from TICCL frequency files.

For each file in in_files, open it and yield a dictionary with frequencies (value) for each word (key).