|||
Embedding layer is to turn positive integers (indexes) into dense vectors of fixed size.
Why should we use an embedding layer? Two main reasons:
1. One-hot encoded vectors are high-dimensional and sparse. Let's assume that we are doing Natural_Language Processing (NLP) and have a dictionary of 2000 words. This means that, when_using one-hot encoding, each word will be represented by a vector containing 2000 integers. And 1999 of these integers are zeros. In a big dataset this approach is not computationally efficient.
2.The vectors of each embedding get updated while training the neural network. If you have seen_the image at the top of this post you can see how similarities between words can be found in a multi-dimensional space. This allows us to visualize relationships between words, but also between everything that can be turned into a vector through an embedding layer.
This concept might still be a bit vague. Let’s have a look at what an embedding layer does with an example of words. Nevertheless, the origin of embeddings comes from word embeddings. You can look up word2vec if you are interested in reading more. Let’s take this sentence as an example (do not take it to seriously):
“deep learning is very deep”
The first step in using an embedding layer is to encode this sentence by indices. In this case we assign an index to each unique word. The sentence than looks like this:
1 2 3 4 1
The embedding matrix gets created next. We decide how many ‘latent factors’ are assigned to each index. Basically this means how long we want the vector to be. General use cases are lengths like 32 and 50. Let’s assign 6 latent factors per index in this post to keep it readable. The embedding matrix than looks like this:
So, instead of ending up with huge one-hot encoded vectors we can use an embedding matrix to_keep the size of each vector much smaller. In short, all that happens is that the word “deep” gets represented by a vector [.32, .02, .48, .21, .56, .15]. However, not every word gets replaced by a vector. Instead, it gets replaced by index that is used to look-up the vector in the embedding matrix. Once again, this is computationally efficient when using very big datasets. Because_the embedded vectors also get updated during the training process of the deep neural network, we can explore what words are similar to each other in a multi-dimensional space. By using dimensionality reduction techniques like t-SNE these similarities can be visualized (https://lvdmaaten.github.io/tsne/).
t-SNE visualization of word embeddings
From: https://medium.com/towards-data-science/deep-learning-4-embedding-layers-f9a02d55ac12
If you want to use the embedding it means that the output of the embedding layer will have 3 dimensions. This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to 2 dimensions:
An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:
From: http://www.orbifold.net/default/2017/01/10/embedding-and-tokenizer-in-keras/
keras.layers.embeddings.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)
input_dim: int>0. Size of the vocabulary, i.e., maximum integer index +1 (that is the number of unique intergers)
output_dim: int>0. Dimension of the dense embedding. It is like the length of 'latent factors'.
embeddings_initializer: Initializer for the embeddings matrix.
embeddings_regularizer: Regularizer function applied to the embeddings matrix
embeddings_constraint: Constraint function applied to the embeddings matrix
input_length: Length of input sequences. It equals to input.shape[1]
Input shape:
2D tensor with shape (batch_size, sequence_length)
Output shape:
3D tensor with shape (batch_size, sequence_length, output_dim).
From: https://keras.io/layers/embeddings/
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-7-28 12:30
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社