Training and using classifiers
train_classifier
Train an sklearn classifier to produce a cleanness score for sentence pairs.
Parameters:
training_scores: a file containing filter scores for training in JSON lines format produced withscorefunction.criterion: criterion to be used in classifier optimization (valid options areCE,ROC_AUC,SSE,AICandBIC)dev_scores: a file containing filter scores for training in JSON lines format produced withscorefunction with and added itemlabeladded to each entry.labelhas value 1 for clean pairs and 0 for noisy pairs (optional;dev_scoresis only used when thecriterionisROC_AUC)model_type: classifier model type selected fromsklearnclassifiers (defaultLogisticRegression)model_parameters: parameters for thesklearnclassifiermodel: output model filefeatures: the features given to the classifier to be trained on, defined as a list of filter namesExampleFilter:clean-direction: the direction that indicates higher cleanness (valid options arehighandlow)quantiles: a dictionary the items of which (min,maxandinitial) specify the minimum, maximum and inital quantile value that are used in classifier optimization to select negative and positive training examples (default{'min': 0, 'max': 1, 'initial': 0.1})
The classifier is optimized by training multiple classifier model with the training data divided differently into positive and negative examples based on the quantile boundaries specified in each feature. The model that achieves the highest criterion score is then saved in the output file.
classify
Use a classifier model trained with train_classifier to assign a cleanness score or label to sentence pairs that have been scored with score.
Parameters:
model: classifier model trained withtrain_classifierscores: scores of the sentence pairs to be classifed in JSON lines format produced with thescorefunctionoutput_probabilities: file to write the cleanness scores to, 1 is cleanest and 0 is noisiest (optional)output_labels: file to write the cleanness labels to, 1 is a clean and 0 is a noisy pair (optional)
The probabilities and labels are written to the output files line by line, corresponding to the scores on each line in scores.