Universal Sentence Encoder

来自Google Research的一篇文章,在这篇文章中作者们提出了一种通用句子编码器,相比于传统的word embedding,该编码器在多个不同的NLP任务上都取得了更好的准确率,可以用来做迁移学习。
paper link
code link


In this paper, we present two models for producing sentence embeddings that demonstrate good transfer to a number of other of other NLP tasks.

embed = hub.Module("")
embeddings = embed([
"The quick brown fox jumps over the lazy dog.",
"I am a sentence for which I would like to get its embedding"])


# The following are example embedding output of 512 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [-0.016987282782793045, -0.008949815295636654, -0.0070627182722091675, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [0.03531332314014435, -0.025384284555912018, -0.007880025543272495, ...]

_This module is about 1GB. Depending on your network speed, it might take a while to load the first time you instantiate it. After that, loading the model should be faster as modules are cached by default (learn more about caching). Further, once a module is loaded to memory, inference time should be relatively fast._

文章共提出两种基于不同网络架构的Universal Sentence Encoder:

Our two encoders have different design goals. One based on the transformer architecture targets high accuracy at
the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy.





Deep Averaging Network (DAN)

The second encoding model makes use of a deep averaging network (DAN) (Iyyer et al.,2015) whereby input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings.


Transfer Learning Models

  • 对于文本分类任务,将两种结构的sentence encoder的输出作为分类模型的输入;
  • 对于语义相似度任务,直接通过sentence encoder的输出向量计算相似度:

    As shown Eq. 1, we first compute the cosine similarity of the two sentence embeddings and then use arccos to convert the cosine similarity into an angular distance.We find that using a similarity based on angular distance
    performs better on average than raw cosine similarity.
    $$sim(u, v) = (1 - arccos(\frac{u \cdot v}{\left | u \right | \left | v \right |})/\pi ) \: \: \: \:\: \: \: \:\: \: \: \: (1)$$



  • 使用word2vec的baseline
  • 未使用任何预训练模型

Combined Transfer Models

本文还尝试将sentence level 和 word level两种模型融合,实验结果如下。


Table  2:  Model  performance  on  transfer  tasks.  USETis  the  universal  sentence  encoder  (USE)  using Transformer.  USEDis  the  universal  encoder  DAN  model.  Models  tagged  with w2v  w.e. make  use  of pre-training  word2vec  skip-gram  embeddings  for  the  transfer  task  model,  while  models  tagged  with lrn w.e. use  randomly  initialized  word  embeddings  that  are  learned  only  on  the  transfer  task  data.  Accuracy is  reported  for  all  evaluations  except  STS  Bench  where  we  report  the  Pearson  correlation  of  the  similarity  scores  with  human  judgments.  Pairwise  similarity  scores  are  computed  directly  using  the  sentence embeddings  from  the  universal  sentence  encoder  as  in Eq.(1)

  • MR : Movie review snippet sentiment on a five star scale (Pang and Lee, 2005).

  • CR : Sentiment of sentences mined from customer reviews (Hu and Liu, 2004).

  • SUBJ : Subjectivity of sentences from movie reviews and plot summaries (Pang and Lee, 2004).

  • MPQA : Phrase level opinion polarity from news data (Wiebe et al., 2005).

  • TREC : Fine grained question classification sourced from TREC (Li and Roth, 2002).

  • SST : Binary phrase level sentiment classification (Socher et al., 2013).

  • STS Benchmark : Semantic textual similarity (STS) between sentence pairs scored by Pearson correlation with human judgments (Cer et al.,2017).


  1. 基于Transform的USE往往优于DAN
  2. USE优于仅仅使用word level encoder
  3. 最优结果往往是sentence level和word level结合

Table 3 illustrates transfer task performance for varying amounts of training data. We observe that, for smaller quantities of data, sentence level transfer learning can achieve surprisingly good task performance. As the training set size increases, models that do not make use of transfer learning approach the performance of the other models.

Table  3:  Task  performance  on  SST  for  varying  amounts  of  training  data.  SST  67.3k  represents  the  full training  set.  Using  only  1,000  examplesfor  training,  transfer  learning  from  USET  is  able  to  obtain performance  that  rivals  many  of  the  other  models  trained  on  the  full  67.3  thousand  example  training  set.


基于sentence level的USE模型在大部分迁移学习任务上优于word level,尤其是在小规模数据集上,sentence level与word level结合则能实现最佳的准确率。