Notes on neural machine translation system
This blog is inspired by an NVIDIA tutorial on building a machine translation system using neural networks.
I wanna write out my own thought after reading this post.
Thoughts of the NVIDIA post
People have built a lot of MT(machine translation) systems using the statistical way, and we call them the Statisticall Machine Translation (SMT) systems.
SMT with Training Data
We can obtain dataset from the Internet for MT, such as from Workshop on Statistical Machine Translation or the International Workshop on Spoken Language Translation.
Using the data in pairs of \(D = \{(x^1, y^1), \cdots , (x^N, y^N)\}\) where x stands for the source sentence and y the target sentence, we can build an SMT by maximizing the log-likelihood for the dataset.
SMT may adopt neural networks as feature functions or some reranking strategies. But this post focuses on building an MT in a COMPLETE neural network way.
Recurrent Neural Network
To output something \(y_i\), a recurrent neural network may use some affine transformation like \(h_t = \tanh(Wx_t + Uh_{t-1} + b)\).
To implement this kind of RNN is easy in theano. See RNN in materials.
Other sophisticated activation functions may be useful:
- long short-term memory units[Hochreiter and Schmidhuber, 1997]
- gated recurrent units[Cho et al., 2014]
These are sophisticated but easy to implement in theano. See LSTM in materials.
In RNN, we model a sentence by \(p(x_1, x_2, ..., x_T) = \prod_i p(x_i \vert x_{i-1},...,x_1)\). And we use \(g_\theta\) to produce a probability, \(h_{t-1}\) is the previous hidden state. \(p(x_i \vert x_{i-1},...,x_1) = g_\theta(h_{t-1}) h_{t-1} = \phi_\theta(x_{i-1}, h_{t-2})\)
The author gives one of his slides to explain how to use RNN for language modeling.
An Encoder-Decoder Architechture for MT
Encoder-Decoder Architechture is old, but here we use RNN’s for encoder and decoder part.
Using RNN, the encoder reads in a sentence, and stores it as a representation at the last hidden state \(h_T\). By using PCA to reduce dimensions to two, we can see how the last representation of a sentence is like.
Encoder flow is: input words (1-hot) -> continuous representation -> RNN hidden states
Decoder flow is: RNN states -> word probability -> word sample.
To train a model like this only needs tradition methods.
- Back-prop
- SGD
- MLE
Gradient computation is easy with Theano.
Adjusting parameters is frustrating. So we have adaptive learning rate algorithms.
Introduce Attention-based Mechanism
The big problem in Encoder-Decoder architecture: the performance decline while the sentence gets longer.
We may use two RNN’s. One reads the sentence from left and the other from right.
NOT QUITE UNDERSTAND THEN
TODO: DO MORE INVESTIGATION
And to include a small neural network in a decoder to do almost exactly this. This small network is called attention mechanism. It has a nice performance.
Other things to be investigated then
deeplearning.net materials:
Recurrent Neural Network with word embeddings
- how to do inference?
- how to train an RNN using back-prop?
- try to write one in 10 lines in theano
- RNN for language modeling
- code sample: https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/lstm.py#L457
- docs: http://deeplearning.net/software/theano/library/gradient.html
Adaptive learning rate algorithms:
- Adadelta [Zeiler, 2012]
- Adam [Kingma and Ba, 2015]
- What problem are we confronting?
- What is attention?
- Attention performance: Bahdandau et al., 2015
Other deep learning reference:
- UFLDL: http://ufldl.stanford.edu/tutorial/
- Neural Networks and Deep Learning online book: http://neuralnetworksanddeeplearning.com/index.html
This work is licensed under a Creative Commons Attribution 4.0 International License.