BERTscore: Evaluating text generation with Bert
By- Dikshant
BERT is a evolution over the transformer based architectures that uses attention-mechanism i.e gives contextual information for every word. Â Since Transformers usually aren't bidirectional. It's hard for them to guess the acutal context of the word in a sentence whereas BERT is a bidirectional model => can contextualize the word more effeciently.
Why BERTscore ?
BERTScore was developed to address the limitations of traditional evaluation metrics like BLEU and ROUGE in the field of Natural Language Processing (NLP). These traditional metrics, which are based on n-gram matching, often fail to capture the semantic similarity between texts, especially when the texts use different but semantically similar phrases.
One of the key features of BERTScore is its ability to provide robust paraphrase matching. It leverages the power of BERT’s contextual embeddings to compute similarity scores between tokens in a candidate sentence and a reference sentence. This allows BERTScore to effectively match paraphrases and evaluate the quality of generated text, such as in machine translation or text summarization.
Â
The paper discusses about evaluating text generation with the help of BERT.
Suppose we generated text using some BERT model then the similarity between generated text and actual text is the BERTscore.
In the CNN diag. we see that we pass Reference XÂ and Candidate XÌ‚ through pre-trained bert model that generates Contextual Embeddings.
The contextual embedding is then  checked for similarity for each of the words from reference to each of the word in candidate set.
Now we sum max. similarity for each of the words from reference to the candidate. Then Normalize it.
Similar stuff was done from candidate to reference context by normalizing by the total no. of words we have in candidate we call it Precision.
Then we can have F-score which H.M. of precision and recall.
     i.e not at all good for semantic similarity.
Â