Skip to main content

Pretraining

The pretraining bit was added recently with the release of spaCy 2.1 and allows the use of Language Modelling with Approximate Outputs (LMAO), or training of a transformer for the ‘token-to-vector’ (tok2vec) layer of pipeline components with the language modeling objective of predicting word vectors rather than words.

Following the spaCy 2.1 release notes I ran two pretraining jobs, the first one can be used with any model that does not use vectors, such as en_core_web_sm or a blank model, and the second one can be used with en_core_web_md/lg.

python -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-modelpython -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-model-vecs --use-vectors

Now we are ready to train the TextCategorizer. We just disable the other pipeline components as we train.

I trained six models: three blank models and three en_core_web_lg models. For each group one model was trained without pretraining, one with corresponding pretrained weights of 50 iterations and one with corresponding pretrained weights of 1000 iterations. Then I evaluated training loss and accuracy, precision, recall and F1 scores on the test set for each of the five training iterations. Here we report only loss, accuracy and F1 scores but the full results as well as the full training code can be found on GitHub.

Considering that our pretraining data was not very large, only about 120,000 words, and that it was given as sentence (titles, that might not even be sentences) chunks rather than recommended paragraph-sized chunks, the pretraining with 1000 iterations results in a reasonable improvement.