Training and using classifiers
train_classifier
Train an sklearn
classifier to produce a cleanness score for sentence pairs.
Parameters:
training_scores
: a file containing filter scores for training in JSON lines format produced withscore
function.criterion
: criterion to be used in classifier optimization (valid options areCE
,ROC_AUC
,SSE
,AIC
andBIC
)dev_scores
: a file containing filter scores for training in JSON lines format produced withscore
function with and added itemlabel
added to each entry.label
has value 1 for clean pairs and 0 for noisy pairs (optional;dev_scores
is only used when thecriterion
isROC_AUC
)model_type
: classifier model type selected fromsklearn
classifiers (defaultLogisticRegression
)model_parameters
: parameters for thesklearn
classifiermodel
: output model filefeatures
: the features given to the classifier to be trained on, defined as a list of filter namesExampleFilter
:clean-direction
: the direction that indicates higher cleanness (valid options arehigh
andlow
)quantiles
: a dictionary the items of which (min
,max
andinitial
) specify the minimum, maximum and inital quantile value that are used in classifier optimization to select negative and positive training examples (default{'min': 0, 'max': 1, 'initial': 0.1}
)
The classifier is optimized by training multiple classifier model with the training data divided differently into positive and negative examples based on the quantile boundaries specified in each feature. The model that achieves the highest criterion score is then saved in the output file.
classify
Use a classifier model trained with train_classifier
to assign a cleanness score or label to sentence pairs that have been scored with score
.
Parameters:
model
: classifier model trained withtrain_classifier
scores
: scores of the sentence pairs to be classifed in JSON lines format produced with thescore
functionoutput_probabilities
: file to write the cleanness scores to, 1 is cleanest and 0 is noisiest (optional)output_labels
: file to write the cleanness labels to, 1 is a clean and 0 is a noisy pair (optional)
The probabilities and labels are written to the output files line by line, corresponding to the scores on each line in scores
.