data loader

dataloader.element_set module

class dataloader.element_set.ElementSet(name, data_format, options, raw_data_strings=None)[source]

Bases: object

Dataset Object

Parameters:
  • name (str) – dataset name
  • data_format (str) – dataset format, either “set” or “sip”
  • options (dict) – dataset parameters, including two dicts mapping element to element index
  • raw_data_strings (list) –

    a list of strings representing an element set.

    • If data_format is “set”, each string is of format “c0 {‘d93’, ‘d377’, ‘d141’, ‘d63’, ‘d166’}”.
    • If data_format is “sip”, each string is of format “{‘d93’, ‘d377’} d141 0”.
_convert_set_format_to_sip_format(raw_sets, pos_strategy, neg_strategy, neg_sample_size=10, subset_size=5, max_set_size=50)[source]

Generate <set, instance> pairs (sip) from a collection of sets

Parameters:
  • raw_sets (list) – a list of sets
  • pos_strategy (str) – name of positive sampling method
  • neg_strategy (str) – name of negative sampling method
  • neg_sample_size (int) – negative sampling ratio
  • subset_size (int) – size of “set” in <set, instance> pairs, used only in “fix_size_repeat_set” pos_strategy
  • max_set_size (int) – maximum size of “set” in <set, instance> pairs, used only in “vary_size_enumerate” pos_strategy
Returns:

len(raw_sets) * (1 + neg_sample_size) sips, among which len(raw_sets) sips are positive and len(raw_sets) * neg_sample_size sips are negative

Return type:

list

Notes

  • if pos_strategy is “sample_size_repeat_set”, for each original set, we sample the size of “set” in sip, repeat this generated set neg_sample_size times, and pair them with each negative instance. This is the strategy to original AAAI submission.
  • if pos_strategy is “sample_size_random_set”, for each original set, we sample one size of “set” in sip, and generate one set for each negative instance.
  • if pos_strategy is “fix_size_repeat_set”, for each original set, we use pre-determined subset size to generate one “set” in sip, repeat this generated set neg_sample_size times, and pair them with each negative instance. This is the one used in cold-start training.
  • if pos_strategy is “vary_size_enumerate”, for each original set and for each subset size less than max_set_size, we enumerate the original set and generate all possible sips. This is the one used for converting test_set in “set” format to “sip” format.
  • if pos_strategy is “vary_size_enumerate_with_full_set”, it’s basically same as the “vary_size_enumerate” strategy, except that it will also generate full set with only negative instances
  • if pos_strategy is “vary_size_enumerate_with_full_set_plus_group_id”, it’s basically same as the “vary_size_enumerate_with_full_set” strategy, expect that it will also return the group id of each sip the group id is this sip’s corresponding raw set index
  • if pos_strategy is “enumerate”, this is the one used for pre-generating sip triplets
_convert_sip_format_to_tensor(max_set_size, batch_set, batch_inst, labels)[source]

Generate tensors for <set, instance> pairs

Parameters:
  • max_set_size (int) – maximum size of “set” in <set, instance> pairs
  • batch_set (list) – a list of “sets” in <set, instance> pairs
  • batch_inst (list) – a list of “instances” in <set, instance> pairs
  • labels (list) – a list of labels for each above <set, instance> pair
Returns:

a dict of pytorch tensors representing <set, instance> pairs with their corresponding labels

Return type:

dict

_generate_negative_samples_within_pool(positive_sets, neg_sample_size, remove_pos=True)[source]

Generate negative samples from vocabulary

Parameters:
  • positive_sets (list) – a list of positive sets
  • neg_sample_size (int) – negative sampling ratio
  • remove_pos (bool) – whether to remove instances in positive sets from the vocabulary
Returns:

a list of negative sets

Return type:

list

_initialize_set_format(raw_set_strings)[source]

Initialize dataset from a collection of strings representing element sets

Parameters:raw_set_strings (list) – a list of strings representing element sets
Returns:None
Return type:None
_initialize_sip_format(raw_set_instance_strings)[source]

Initialize dataset from a collection of strings representing <set, instance> pairs

Parameters:raw_set_instance_strings (list) – a list of strings representing <set instance> pairs
Returns:None
Return type:None
_shuffle()[source]

Shuffle dataset

Returns:None
Return type:None
get_test_batch(max_set_size=5, batch_size=32)[source]

Generate one testing batch of <set, instance> pairs

Parameters:
  • max_set_size (int) – maximum size of set S
  • batch_size (int) – number of <set, instance> pairs in one batch
Returns:

a testing batch containing “batch_size” <set, instance> pairs

Return type:

dict

get_train_batch(max_set_size=100, pos_sample_method='sample_size_random_set', neg_sample_size=1, neg_sample_method='complete_random', batch_size=32)[source]

Generate one training batch of <set, instance> pairs

Parameters:
  • max_set_size (int) – maximum size of set S
  • pos_sample_method (str) – name of positive sampling method
  • neg_sample_size (int) – number of negative samples for each set
  • neg_sample_method (str) – name of negative sampling method
  • batch_size (int) – number of sets in one batch
Returns:

a training batch containing “batch_size * (1+neg_sample_size)” <set, instance> pairs

Return type:

dict

model

class model.SSPM(params)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Synonym Set Prediction Model (SSPM), namely SynSetMine

Parameters:params (dict) – a dictionary containing all model specifications
_get_test_sip_batch_size(x)[source]
_set_scorer(set_tensor)[source]

Return the quality score of a batch of sets

Parameters:set_tensor (tensor) – sets to be scored, size: (batch_size, max_set_size)
Returns:scores of all sets, size: (batch_size, 1)
Return type:tensor
initialize(params)[source]

Initialize model components

Parameters:params (dict) – a dictionary containing all model specifications
Returns:None
Return type:None
predict(batch_set_tensor, batch_inst_tensor)[source]

Make set instance pair prediction

Parameters:
  • batch_set_tensor (tensor) – packed sets in a collection of <set, instance> pairs, size: (batch_size, max_set_size)
  • batch_inst_tensor (tensor) – packed instances in a collection of <set, instance> pairs, size: (batch_size, 1)
Returns:

  • scores of packed sets, (batch_size, 1)
  • scores of packed sets union with corresponding instances, (batch_size, 1)
  • the probability of adding the instance into the corresponding set, (batch_size, 1)

Return type:

tuple

train_step(train_batch)[source]

Train the model on the given train_batch

Parameters:train_batch (dict) – a dictionary containing training batch in <set, instance> pair format
Returns:batch_loss, true_positive_num, false_positive_num, false_negative_num, true_positive_num
Return type:tuple
model.initialize_weights(moduleList, itype='xavier')[source]

Initialize a list of modules

Parameters:
  • moduleList (list) – a list of nn.modules
  • itype (str) – name of initialization method
Returns:

None

Return type:

None

cluster predictor

cluster_predict.multiple_set_single_instance_prediction(model, sets, instance, size_optimized=False)[source]

Apply the given model to predict the probabilities of adding that one instance into each of the given sets

Parameters:
  • model (SSPM) – a trained SynSetMine model
  • sets (list) – a list of sets, each contain the element index
  • instance (int) – a single instance, represented by the element index
  • size_optimized (bool) – whether to optimize the multiple-set-single-instance prediction process. If the size of each set in the given ‘sets’ varies a lot and there exists a single huge set in the given ‘sets’, set this parameter to be True
Returns:

  • scores of given sets, (batch_size, 1)
  • scores of given sets union with the instance, (batch_size, 1)
  • the probability of adding the instance into the corresponding set, (batch_size, 1)

Return type:

tuple

cluster_predict.set_generation(model, vocab, threshold=0.5, eid2ename=None, size_opt_clus=False, max_K=None, verbose=False)[source]

Set Generation Algorithm

Parameters:
  • model (SSPM) – a trained set-instance classifier
  • vocab (list) – a list of elements to be clustered, each element is represented by its index
  • threshold (float) – the probability threshold for determine whether to create new singleton cluster
  • eid2ename (dict) – a dictionary mapping element index to its corresponding (human-readable) name
  • size_opt_clus (bool) – a flag indicating whether to optimize the multiple-set-single-instance prediction process
  • max_K (int) – maximum number of clusters, If None, we will infer this number automatically
  • verbose (bool) – whether to print out all intermediate results
Returns:

a list of detected clusters

Return type:

list

evaluator

evaluator.calculate_km_matching_score(weight_nm)[source]

Calculate maximum weighted matching score

Parameters:weight_nm (list) – a similarity matrix
Returns:weighted matching score
Return type:float
evaluator.calculate_precision_recall_f1(tp, fp, fn)[source]

Calculate precision, recall, and f1 score

Parameters:
  • tp (int) – true positive number
  • fp (int) – false positive number
  • fn (int) – false negative number
Returns:

(precision, recall, f1 score)

Return type:

tuple

evaluator.end2end_evaluation_matching(groundtruth, result)[source]

Evaluate the maximum weighted jaccard matching of groundtruth clustering and predicted clustering

Parameters:
  • groundtruth (list) – a list of element lists representing the ground truth clustering
  • result (list) – a list of element lists representing the model predicted clustering
Returns:

best matching score

Return type:

float

evaluator.evaluate_clustering(cls_pred, cls_true)[source]

Evaluate clustering results

Parameters:
  • cls_pred (list) – a list of element lists representing model predicted clustering
  • cls_true (list) – a list of element lists representing the ground truth clustering
Returns:

a dictionary of clustering evaluation metrics

Return type:

dict

evaluator.evaluate_set_instance_prediction(model, dataset)[source]

Evaluate model on the given dataset for set-instance pair prediction task

Parameters:
  • model (SSPM) – a trained set-instance classifier
  • dataset (ElementSet) – an ElementSet dataset with
Returns:

a dictionary of set-instance pair prediction metrics

Return type:

dict

utils

class utils.Metrics[source]

Bases: object

A metric class wrapping all metrics

add(metric_name, metric_value)[source]

Add metric value for the given metric name

Parameters:
  • metric_name (str) – metric name
  • metric_value – metric value
Returns:

None

Return type:

None

class utils.Results(filename)[source]

Bases: object

A result class for saving results to file

Parameters:filename (str) – name of result saving file
save_metrics(hyperparams, metrics)[source]

Save model hyper-parameters and evaluation results to the file

Parameters:
  • hyperparams (dict) – a dictionary of model hyper-parameters, keyed with the hyper-parameter names
  • metrics (Metrics) – a Metrics object containg all model evaluation results
Returns:

None

Return type:

None

utils.check_model_consistency(args)[source]

Check whether the model architecture is consistent with the loss function used

Parameters:args (dict) – a dictionary containing all model specifications
Returns:a flag indicating whether the model architecture is consistent with the loss function, if not, also return the error message
Return type:a tuple of (bool, str)
utils.get_num_lines(file_path)[source]

Return the number of lines in the file without actually reading them into the memory. Used together with tqdm for tracking file reading progress.

Usage:

with open(inputFile, "r") as fin:
    for line in tqdm(fin, total=get\_num\_lines(inputFile)):
        ...
Parameters:file_path (str) – path of input file
Returns:number of lines in the file
Return type:int
utils.load_checkpoint(model, optimizer, load_dir, load_prefix, step)[source]

Load model checkpoint (including trained model, training epoch, and optimizer) from a file

Notes

Parameters:
  • model (torch.nn) – a model which has the same architecture of the model to be loaded
  • optimizer (torch.optim) – a pytorch optimizer
  • load_dir (str) – model save directory
  • load_prefix (str) – model snapshot prefix
  • step (int) – model training epoch
Returns:

None

Return type:

None

utils.load_embedding(fi, embed_name='word2vec')[source]

Load pre-trained embedding from file

Parameters:
Returns:

  • embedding : embedding file
  • index2word: map from element index to element
  • word2index: map from element to element index
  • vocab_size: size of element pool
  • embed_dim: embedding dimension

Return type:

(gensim.KeyedVectors, list, dict, int, int)

utils.load_model(model, load_dir, load_prefix, steps)[source]

load model from file

Note: You need to first initialize a model which has the same architecture/size of the model to be loaded.

Parameters:
  • model (torch.nn) – a model which has the same architecture of the model to be loaded
  • load_dir (str) – model save directory
  • load_prefix (str) – model snapshot prefix
  • steps (int) – model training epoch
Returns:

None

Return type:

None

utils.load_raw_data(fi)[source]

Load raw data from file

Parameters:fi (str) – data file name
Returns:a list of raw data from file
Return type:list
utils.my_logger(name='', log_path='./')[source]

Create a python logger

Parameters:
  • name (str) – logger name
  • log_path (str) – path for saving logs
Returns:

a logger for logging messages

Return type:

python logger

utils.print_args(args, interested_args='all')[source]

Print arguments in command line

Parameters:
  • args (Namespace) – parsed command line argument
  • interested_args (list) – a list of interested argument names
Returns:

None

Return type:

None

utils.save_checkpoint(model, optimizer, save_dir, save_prefix, step)[source]

Save model checkpoint (including trained model, training epoch, and optimizer) to a file

Parameters:
  • model (torch.nn) – a trained model
  • optimizer (torch.optim) – a pytorch optimizer
  • save_dir (str) – model save directory
  • save_prefix (str) – model snapshot prefix
  • step (int) – model training epoch
Returns:

None

Return type:

None

utils.save_model(model, save_dir, save_prefix, steps)[source]

Save model to file

Parameters:
  • model (torch.nn) – a trained model
  • save_dir (str) – model save directory
  • save_prefix (str) – model snapshot prefix
  • steps (int) – model training epoch
Returns:

None

Return type:

None

utils.to_gpu(optimizer, device)[source]

Move optimizer from CPU to GPU

Parameters:
  • optimizer (torch.optim) – a pytorch optimizer
  • device (torch.device) – a pytorch device, CPU or GPU
Returns:

None

Return type:

None