JointReps: Jointly Learning Word Representations using a Corpus and a Knowledge Base (KB)
Overview.
JointReps is a joint model for learning distributed word vector representations (word embeddings) from both large text corpora and knowledge bases (KBs). JointReps utilizes the invaluable semantic relational structure between words existed in KBs and the words co-occurrence statistics in text corpora to learn word representations in vector spaces. JointReps particularly uses the corpus and the KBs to define a global joint objective function.
JointReps has several advantages of utilizing the KBs:
- It benefits from the knowledge existed in the KBs during the word representations learning phase
- Any KB that specifies the semantic relations existed between words, such as WordNet, FrameNet and Paraphrase Database can be used with JointReps
- It uses three different novel mechanisms (SKB, NNE, MNE) of integrating the knowledge from the KBs. Details are reported in the published work below
By combining the knowledge in the KBs into the process of learning word vector representations from the corpus (as shown in the published works below), JointReps has proved to report:
- A significant improvement over the corpus-only approaches in the quality of the learnt word embeddings
- An improvment over various prior work models that combine the two sources for learning word embeddings
- A stable performance among variety of word vector representations dimensions
Publications.
JointReps model was decribed in the following papers, please refer to them if you use any of the available resources
- Mohammed Alsuhaibani, Danushka Bollegala, Takanori Maehara and Ken-ichi Kawarabayashi: Jointly Learning Word Embeddings using a Corpus and a Knowledge Base, PLOS ONE, 2018 [PDF]
- Danushka Bollegala, Mohammed Alsuhaibani, Takanori Maehara, and Ken-ichi Kawarabayashi: Joint Word Representation Learning using a Corpus and a Semantic Lexicon, 30th AAAI Conference on Aritificial Intelligence (AAAI), pp. 2690-2696, Arizona, USA. (2016.2) [PDF][BibTex]
Downloads.
The pre-trained word vectors reported in the above publications are available for downloading.
- 2 Billions tokens (full ukWaC corpus), 400k vocab, 300d, 87K WordNet synonym edges, Static Knowledge Base (SKB) [JointReps_2B_87Ksyn_300d_SKB.zip]
- 2 Billions tokens (full ukWaC corpus), 400k vocab, 300d, 108K WordNet synonym edges, Nearest Neighbour Expansion (NNE) [JointReps_2B_108Ksyn_300d_NNE.zip]
- 2 Billions tokens (full ukWaC corpus), 400k vocab, 300d, 104K WordNet synonym edges, Hedged Nearest Neighbour Expansion (HNE) [JointReps_2B_104Ksyn_300d_HNE.zip]
Codes.
The source code is available [here].
Authors.
- Mohammed Alsuhaibani, PhD student, Department of Computer Science, University of Liverpool, UK
- Danushka Bollegala, Associate Professor, Department of Computer Science, University of Liverpool, UK
- Takanori Maehara, Assistant Professor, Shizuoka University, Shizuoka, Japan
- Ken-ichi Kawarabayashi, Professor, National Institute of Informatics, Tokyo, Japan
Contact.
For any enquiries about JointReps, please feel free to contact: m[dot]a[dot]alsuhaibani[at]liverpool[dot]ac[dot]uk