BERT – the best language model this year

3 minute read

Published: December 30, 2018

BERT is the best language model proposed in November, which achieved state-of-the-art results in almost all NLP tasks.

A demo was implemented using BERT to accomplish a classification task.

A short description of BERT

You can find many articles about BERT on the Internet. Before end-to-end model like CNN and transformer appeared, pre-processing data was required like word segmentation and Word2Vec when tackling a NLP task. Nevertheless, end-to-end models free us of the tedious work, all we need do is enter the sentence and then get a vector-level representation.

BERT discovers the relationship between words in a sentence by bi-diretional transformers. And the connections between sentences were trained by the fact whether a sentence appeared before another sentence.

The implementation of BERT is available on the Github [2]. You can utilize the pretrained model to solve your NLP problems.

Dataset in the task

The dataset of the task came from Sougou News [1]. There are 10 classes in this dataset, [‘体育’, ‘健康’, ‘军事’, ‘娱乐’, ‘教育’, ‘文化’, ‘时尚’, ‘科技’, ‘财经’, ‘汽车’]. Every class have 5000 pieces of training data, 500 pieces of validation data, and 1000 pieces of prediction data.

The dataset is organized in a XML format so you need to reorganize the data first. The origin data only provides the link, title, news content, and the url of the news. According to the url, each news can be labelled as a class.

Implementation of the classification model

The implementation of the classification model is quite simple. The point is that you need to override Class Processor as the defualt code did [3].

Then map your new processor class with a task name, say sougou.

One must be careful when run the code, especially the input directory and the output directory. Even a slash results in a mistake.

python run_classifier.py  \
--task_name=sougou   \
--do_train=true   \
--do_eval=true  \
--do_predict=true  \
--data_dir=$MYDATA \
--vocab_file=$BERT_BASE_DIR/vocab.txt   \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt   \
--max_seq_length=384   \
--train_batch_size= 12  \
--learning_rate=3e-5   \
--num_train_epochs=2   \
--output_dir=/tmp/sougou_output/

Debug is more difficult

Compared with writing code, debug is a harder thing. After training, the evaluation accuracy is always 0.1. Guessing randomly in 10 classes has a probility 0.1, which means our model is useless. This bug bothered me for 3 days.

Finally, we found only the first 5000 pieces of data were trained. All of these data was labelled as class car. It is famous overfitting. After trained, the machine mistakes every news as class car.

To solve this problem, readily shuffle.

from random import shuffle
shuffle(lines)

The final results are satisfying as demonstrated in the paper about BERT.

Some hyper-parameters are listed below.

Hyper-parameters	Value
Batch size	12
Learning rate	3e-5
Epoch	2
max_seq_length	384
pretrained parameters	chinese_L-12_H-768_A-12

Lessons I learn

Print loss! Print loss! Print loss! A loss reflects the process of the training. If a loss decreases too quickly or too slowly, there must arise a problem. For example, the loss before bug fixed dropped fast, even got to be some value like 0.0008 and went back to 8 or so in the final step. Finally I know that it trained on a same class so the loss could seem very low. Normally, the loss should follow a curve like 2.8, 2.72, 2.65…
Check the hyper-parameters. Bad results sometimes result from bad combinations of hyper-parameters. Learning rate is important, 1e-5 is a little low. At first I set epoch to be 0.1 which is too small. When pre-training, epoch should be more than 100. When fine-tuning, epoch should be 1 or 2. Of coursel, it differs as to different tasks.
Be sensitive about the organization and features of the dataset. I noticed that the first 5000 pieces of data were all about car and the predicton results were all car.

Reference

[1] http://www.sogou.com/labs/resource/cs.php

[2] https://github.com/google-research/bert

[3] https://huhuhang.com/nlp-google-bert/

Share on

Twitter Facebook LinkedIn

Zhengyu Wu

BERT – the best language model this year

A short description of BERT

Dataset in the task

Implementation of the classification model

Debug is more difficult

Lessons I learn

Reference

Share on

You May Also Enjoy

Leetcode 08

Leetcode 08

Leetcode 07

Leetcode 06