ticclat.sacoreutils module

SQLAlchemy core utility functionality

Functionality for faster bulk inserts without using the ORM. More info: https://docs.sqlalchemy.org/en/latest/faq/performance.html

ticclat.sacoreutils.add_corpus_core(session, corpus_matrix, vectorizer, corpus_name, document_metadata=Empty DataFrame Columns: [] Index: [], batch_size=50000)[source]

Add a corpus to the database.

A corpus is a collection of documents, which is a collection of words. This function adds all words as wordforms to the database, records their “attestation” (the fact that they occur in a certain document and with what frequency), adds the documents they belong to, adds the corpus and adds the corpus ID to the documents.

Inputs:

session: SQLAlchemy session (e.g. from dbutils.get_session) corpus_matrix: the dense corpus term-document matrix, like from

tokenize.terms_documents_matrix_ticcl_frequency
vectorizer: the terms in the term-document matrix, as given by
tokenize.terms_documents_matrix_ticcl_frequency

corpus_name: the name of the corpus in the database document_metadata: see ticclat_schema.Document for all the possible

metadata. Make sure the index of this dataframe matches with the document identifiers in the term- document matrix, which can be easily achieved by resetting the index for a Pandas dataframe.

batch_size: batch handling of wordforms to avoid memory issues.

ticclat.sacoreutils.bulk_add_anahashes_core(engine, iterator, **kwargs)[source]

Insert anahashes in iterator in batches into anahashes database table.

Convenience wrapper around sql_insert_batches for anagram hashes. Take care: no session is used, so relationships can’t be added automatically.

ticclat.sacoreutils.bulk_add_textattestations_core(engine, iterator, **kwargs)[source]

Insert text attestations in iterator in batches into text_attestations database table.

Convenience wrapper around sql_insert_batches for text attestations. Take care: no session is used, so relationships can’t be added automatically.

ticclat.sacoreutils.bulk_add_wordforms_core(engine, iterator, **kwargs)[source]

Insert wordforms in iterator in batches into wordforms database table.

Convenience wrapper around sql_insert_batches for wordforms. Take care: no session is used, so relationships can’t be added automatically.

ticclat.sacoreutils.get_engine(user, password, dbname, dburl='mysql://{}:{}@localhost/{}?charset=utf8mb4')[source]

Returns an engine that can be used for fast bulk inserts

ticclat.sacoreutils.get_tas(corpus, doc_ids, wf_mapping, word_from_tdmatrix_id)[source]

Get term attestation from wordform frequency matrix.

Term attestation records the occurrence and frequency of a word in a given document.

Inputs:
corpus: the dense corpus term-document matrix, like from
tokenize.terms_documents_matrix_ticcl_frequency

doc_ids: list of indices of documents in the term-document matrix wf_mapping: dictionary mapping wordforms (key) to database wordform_id word_from_tdmatrix_id: mapping of term-document matrix column index

(key) to wordforms (value)
ticclat.sacoreutils.sql_insert(engine, table_object, to_insert)[source]

Insert a list of objects into the database without using a session.

This is a fast way of (mass) inserting objects. However, because no session is used, no relationships can be added automatically. So, use with care!

This function is a simplified version of test_sqlalchemy_core from here: https://docs.sqlalchemy.org/en/13/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

Inputs:

engine: SQLAlchemy engine or session table_object: object representing a table in the database (i.e., one

of the objects from ticclat_schema)
to_insert (list of dicts): list containg dictionary representations of
the objects (rows) to be inserted
ticclat.sacoreutils.sql_insert_batches(engine, table_object, iterator, total=0, batch_size=10000)[source]

Insert items in iterator in batches into database table.

Take care: no session is used, so relationships can’t be added automatically.

Inputs:
table_object: the ticclat_schema object corresponding to the database
table.
total: used for tqdm, since iterator will often be a generator, which
has no predefined length.
ticclat.sacoreutils.sql_query_batches(engine, query, iterator, total=0, batch_size=10000)[source]

Execute query on items in iterator in batches.

Take care: no session is used, so relationships can’t be added automatically.

Inputs:
total: used for tqdm, since iterator will often be a generator, which
has no predefined length.