Pretraining
The pretraining bit was added recently with the release of spaCy 2.1 and allows the use of Language Modelling with Approximate Outputs (LMAO), or training of a transformer for the ‘token-to-vector’ (tok2vec) layer of pipeline components with the language modeling objective of predicting word vectors rather than words.
Following the spaCy 2.1 release notes I ran two pretraining jobs, the first one can be used with any model that does not use vectors, such as en_core_web_sm or a blank model, and the second one can be used with en_core_web_md/lg.
python -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-modelpython -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-model-vecs --use-vectors
Now we are ready to train the TextCategorizer. We just disable the other pipeline components as we train.