ticclat.sacoreutils module¶
SQLAlchemy core utility functionality
Functionality for faster bulk inserts without using the ORM. More info: https://docs.sqlalchemy.org/en/latest/faq/performance.html
-
ticclat.sacoreutils.
add_corpus_core
(session, corpus_matrix, vectorizer, corpus_name, document_metadata=Empty DataFrame Columns: [] Index: [], batch_size=50000)[source]¶ Add a corpus to the database.
A corpus is a collection of documents, which is a collection of words. This function adds all words as wordforms to the database, records their “attestation” (the fact that they occur in a certain document and with what frequency), adds the documents they belong to, adds the corpus and adds the corpus ID to the documents.
- Inputs:
session: SQLAlchemy session (e.g. from dbutils.get_session) corpus_matrix: the dense corpus term-document matrix, like from
tokenize.terms_documents_matrix_ticcl_frequency- vectorizer: the terms in the term-document matrix, as given by
- tokenize.terms_documents_matrix_ticcl_frequency
corpus_name: the name of the corpus in the database document_metadata: see ticclat_schema.Document for all the possible
metadata. Make sure the index of this dataframe matches with the document identifiers in the term- document matrix, which can be easily achieved by resetting the index for a Pandas dataframe.batch_size: batch handling of wordforms to avoid memory issues.
-
ticclat.sacoreutils.
bulk_add_anahashes_core
(engine, iterator, **kwargs)[source]¶ Insert anahashes in iterator in batches into anahashes database table.
Convenience wrapper around sql_insert_batches for anagram hashes. Take care: no session is used, so relationships can’t be added automatically.
-
ticclat.sacoreutils.
bulk_add_textattestations_core
(engine, iterator, **kwargs)[source]¶ Insert text attestations in iterator in batches into text_attestations database table.
Convenience wrapper around sql_insert_batches for text attestations. Take care: no session is used, so relationships can’t be added automatically.
-
ticclat.sacoreutils.
bulk_add_wordforms_core
(engine, iterator, **kwargs)[source]¶ Insert wordforms in iterator in batches into wordforms database table.
Convenience wrapper around sql_insert_batches for wordforms. Take care: no session is used, so relationships can’t be added automatically.
-
ticclat.sacoreutils.
get_engine
(user, password, dbname, dburl='mysql://{}:{}@localhost/{}?charset=utf8mb4')[source]¶ Returns an engine that can be used for fast bulk inserts
-
ticclat.sacoreutils.
get_tas
(corpus, doc_ids, wf_mapping, word_from_tdmatrix_id)[source]¶ Get term attestation from wordform frequency matrix.
Term attestation records the occurrence and frequency of a word in a given document.
- Inputs:
- corpus: the dense corpus term-document matrix, like from
- tokenize.terms_documents_matrix_ticcl_frequency
doc_ids: list of indices of documents in the term-document matrix wf_mapping: dictionary mapping wordforms (key) to database wordform_id word_from_tdmatrix_id: mapping of term-document matrix column index
(key) to wordforms (value)
-
ticclat.sacoreutils.
sql_insert
(engine, table_object, to_insert)[source]¶ Insert a list of objects into the database without using a session.
This is a fast way of (mass) inserting objects. However, because no session is used, no relationships can be added automatically. So, use with care!
This function is a simplified version of test_sqlalchemy_core from here: https://docs.sqlalchemy.org/en/13/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow
- Inputs:
engine: SQLAlchemy engine or session table_object: object representing a table in the database (i.e., one
of the objects from ticclat_schema)- to_insert (list of dicts): list containg dictionary representations of
- the objects (rows) to be inserted
-
ticclat.sacoreutils.
sql_insert_batches
(engine, table_object, iterator, total=0, batch_size=10000)[source]¶ Insert items in iterator in batches into database table.
Take care: no session is used, so relationships can’t be added automatically.
- Inputs:
- table_object: the ticclat_schema object corresponding to the database
- table.
- total: used for tqdm, since iterator will often be a generator, which
- has no predefined length.
-
ticclat.sacoreutils.
sql_query_batches
(engine, query, iterator, total=0, batch_size=10000)[source]¶ Execute query on items in iterator in batches.
Take care: no session is used, so relationships can’t be added automatically.
- Inputs:
- total: used for tqdm, since iterator will often be a generator, which
- has no predefined length.