by ZWZ
算法简介：
Medianstop 是一种简单的提前终止策略，可参考论文。 如果 Trial X 的在步骤 S 的最好目标值比所有已完成 Trial 的步骤 S 的中位数值明显要低，就会停止运行 Trial X
适用场景：
适用于各种性能曲线，可用到各种场景中来加速优化过程。
优点：
缺点：
算法简介：
Curve Fitting Assessor 是一个 LPA (learning, predicting, assessing，即学习、预测、评估) 的算法。 如果预测的 Trial X 在 step S 比性能最好的 Trial要差，就会提前终止它。 此算法中采用了 12 种曲线来拟合精度
适用场景：
适用于各种性能曲线，可用到各种场景中来加速优化过程。 更好的是，它能够处理并评估性能类似的曲线
优点：
缺点：
NNI provides stateoftheart tuning algorithm as our builtintuners and makes them easy to use. Here are the list:
Usually, base configuration for tuner:
1  # config.yml 
NNI WebUI provides us with multiple ways to montitor training process, which helps us to kill bad trial even in the running phase from webpage. Through these allround training information, we can analyze various tuners from total two different perspectives.
There are two types tuners, the one is some kind of brutal and the other contains some intelligence, and let’s classify our tuners based on this:
Brutal: GridSearch Tuner, Batch Tuner, Random Tuner
Intelligent: TPE Tuner, Evolutionary Tuner, Anneal Tuner, SMAC Tuner
Now, let’s check the performance of each tuner based on Figure 1 above. It’s easy to notice that as the training phase goes, the overall performance of intelligent tuners performs better and better as the training goes, since it learns how to schedule the hyper parameters. Here, the best tuner under this circumstance is Anneal Tuner. I’ m quite astonished until I read the official documents in NNI GitHub. It says:
This simple annealing algorithm begins by sampling from the prior, but tends over time to sample from points closer and closer to the best ones observed.
So, it’s not so surprised to see that Anneal achieves the best 10 average trials score. However, It may fall into local optimal due to shortsighted. TPE addresses this search through ensemble method, like tree based. On the contrary, brutal methods performs quite poor since you can see it still have pretty much trials far below the average score. But sometime, it may lead to a global optimal if you have enough time :)
Besides the scatter figure, NNI web UI also provides us with best 10 trials during experiments. And from Figure 2, you can identity which tuner can have most efficiency finding possible optimal. Although, Grid Search and Random Tuner achieves competitive score in the 50 trials, but its best 10 trials have a large variance. Comparatively, TPE, Anneal Tuner and SMAC achieves much higher score in its best 10 trials, in other words, these tuners are more likely to meet global optimal.
]]>There are plenty of ways to get your dataset dummies. Here is some methods that I’ve known so far:
But all of them do not provide a way of encoding your dataset and then inverse them. In other words, if you have a dataset with categorical data, then you get dummies them, but if you want to inverse your data to get the original dataset back, it seems no way. So if you are focusing on a generative model and want to check its performance, you will get a headache. Therefore, I write a simple Class name MultiLabelEnocder_lsb
which helps you to get over this problem. Here is an example on how to use it:
1  import pandas as pd 
Prepare your data:
1  testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'], 
Output
1  agepetsalary 
Get your sword:
1  my_lsb=MultiLabelBinarizer_lsb() 
Here is what’s the test_X looks like:
1 

Transform your dataset back:
1  my_lsb.inverse_transform(test_X) 
Output:
1  agepet salary 
1  class MultiLabelBinarizer_lsb(object): 
本文对应CMU给出的几个lab中的cache lab的部分解答，代码我托管在github。由于本人能力有限而且懒，所以只写了这个lab的PartA,PartB的1和3，留了一个PartB的第二题，我觉得太麻烦了= =
我花了一个半小时弄清楚partA的题目在让我做什么东西，然后又花了一个小时弄懂getopt()
这个函数怎么用，然后一个晚自习就过去了= =。这次就已经恐怖成这样了，真不知道后面的malloc的lab该成何体统。
题目
天顶洞人版本：
In Part A, you will write a cache simulator in csim.c that takes a valgrind memory trace as input, simulates the hit/miss behavior of a cache memory on this trace, and outputs the total number of hits, misses, and evictions.
正常人版本：
根据之前书里面提到的高速缓存器的内容，根据给定的s，e，b值，模拟出一个高速缓存器对一系列的指令（load，store，modify）中产生的miss，hit和eviction。
提示
使用getopt
函数获得命令行的参数内容，这里给出一个使用实例：
1  int main(int argc, char** argv){ 
这里的eviction出现碰撞的替换策略使用的是LRU(Least Recently Used)，也就是替换最久之前被使用的那一行，我想的模拟方法是采用一个计数器curr_pc,每次读取一个指令计数器加一，每使用到一行缓存或者把数据load进缓存时就更新其curr_pc，这样每次出现碰撞的时候只需要看那一组里谁的curr_pc最小就行了。
实现
首先是构建数据结构：
1  typedef struct cacheLine{ 
初始化cache
1  void init_cache(){ 
获得b，E，s，v的值
1  while(1 != (opt1 = getopt(argc, argv, "hvs:E:b:t:"))){ 
处理指令
1  void process_file(){ 
最重要的模拟缓存的处理机制
1  cache_line* load_cache(int tag_num,int set_num){ 
ps其实这里只要写出一个load的模拟就行了，store直接调用load，modify直接调用load和store。
结果
这部分要求利用程序的局部性构造矩阵转制算法，分别转制32*32
,64*64
,61*67
的三种矩阵。在给定的高速缓存下，尽可能多的hit以及尽可能少的miss。
writeup里给出了一个对我们转制算法非常有用的提示：
Blocking is a useful technique for reducing cache misses. See http://csapp.cs.cmu.edu/public/waside/wasideblocking.pdf for more information.
也就是分块技术，通过对矩阵进行分块转制，从而避免了缓存在A，B两个矩阵之间的冲突现象。
由于我比较懒，而且看64*64
的转制比较繁琐又有点难，就没有做了，不过网上应该有很多解答的，我这里只考虑第一个和第三个矩阵的转制。
32*32
由于给的高速缓存是32组，每组1行，每行32个字节，那么其实这个高速缓存的内部大小可以存256个int，为了避免不必要的冲突不命中，我们将A，B矩阵分为16块，每块有64个int，分别进行转制，一旦每个块都被载入缓存中（冷不命中）后，就不会存在不命中的情况。所有就很大程度的减少了转制的miss。根据这个想法我们可以写出如下代码:
1  void transpose_submit(int M, int N, int A[N][M], int B[M][N]) 
但是其实这样运行一下你会发现实际上有314次miss，比我们💯要求多了14次miss，那么这里的优化在哪里呢？如果我们仔细考虑一下的话，对角线上的元素还是会产生冲突不命中的，因为A，B矩阵上对角线上的元素都会映射到同一块缓存上，为了避免这个问题，我们考虑的方式是对于每一行/列首先处理对角线上的元素，然后向两边扩散，同时，读A矩阵是按照列的方向读取，从而保证了在B缓存了对角线对应行的数据后，A接下来的缓存访问都不会冲突不命中的情况，也就是保证B矩阵缓存的那一行会完整的保留到该行处理结束而不会被eviction。这样我们就会减少约4*8
次miss。即在所有的冷不命中后，之后都是hit。
1  void transpose_submit(int M, int N, int A[N][M], int B[M][N]) 
`6464`**
听说这个很难，我放弃= =
61*67
这个很假，其实只要改一下我们的block大小，然后直接转制就满分了= =
1  if(M==61){ 
结果
总体来说我感觉做的很马虎，但是还是把这个lab给搞掉了，对cache的理解也加深了很多，这学期计算机系统结构有cache的内容，大概可以睡觉了😂
[1] Intro to Computer Systems: AssignmentsCMU
[2] 深入理解计算机系统CacheLabPartB实验报告码龙的窝
[3] EthanYan27/CSAPPLabsgithub
本文是Computer System–A programmer’s perspective第六章存储器层次结构的阅读笔记。对其中的要点加以表述，以方便往后能够快速回忆自己都看了些什么鸟玩意。。。
DRAM的访问方法：
磁盘存储、固态硬盘（SSD）、存储技术趋势
上述都是些死记硬背的东西，所以不想浪费笔墨
局部性有两种：
原文对空间局部性和时间局部性进行了解释：
Locality is typically described as having two distinct forms: temporal locality and spatial locality. In a program with good temporal locality, a memory location that is referenced once is likely to be referenced again multiple times in the near future. In a program with good spatial locality, if a memory location is referenced once, then the program is likely to reference a nearby memory location in the near future.
由此我们对局部性进行总结：
下面是非常经典的memory hierarchy:
这里就引入了一个我们非常耳熟的概念：高速缓存(cache)，书中给出了对于cache的解释:
The central idea of a memory hierarchy is that for each k, the faster and smaller storage device at level k serves as a cache for the larger and slower storage device at level k + 1. In other words, each level in the hierarchy caches data objects from the next lower level.
由于缓存肯定比所有的内存小，所以一定存在如下两种情况：
其中，缓存不命中存在如下几种类型：
小结：
通用高速缓存存储器的组织结构是比较重要的地方，也是这次cache lab第一个作业的主要内容😭，就是让你自己写一个cache stimulator，还是用c写。
对于其结构介绍如下：
Consider a computer system where each memory address has m bits that form M = 2^{m} unique addresses. As illustrated in Figure below, a cache for such a machine is organized as an array of S = 2^{s} cache sets. Each set consists of E cache lines. Each line consists of a data block of B = 2^{b} bytes, a valid bit that indicates whether or not the line contains meaningful information, and t = m − (b + s) tag bits (a subset of the bits from the current block’s memory address) that uniquely identify the block stored in the cache line.In general, a cache’s organization can be characterized by the tuple (S, E, B, m). The size (or capacity) of a cache, C, is stated in terms of the aggregate size of all the blocks. The tag bits and valid bit are not included. Thus, C = S × E × B.
上图说明了对于给定的一个地址，我们如果想要从cache中找到有没有它，所需要的三步和对应的数字：
这里有几种不同的缓存：
1) 让最常见的情况运行的快
2) 尽量减少每个循环内部缓存不命中的数量
讲道理，我觉得上面的是废话= =
比较好的一个例子是下面这个stride的例子:
1  // stride=1 
储存山
后面还提到了一个矩阵相乘的问题，需要综合考虑步长和使用重复变量的问题，这里不再陈述。
三点建议：
 Focus your attention on the inner loops, where the bulk of the computations and memory accesses occur.
 Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.
 Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.
This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
There are two main parts in this week, which has been listed in the title:
Both of them are quite interesting, and at the end of this week’s course, we’ll build a face recognition system and a neural style transfer machine. Let’s learn some basic techniques behind these two funny things.
This technique has two different categories:
But When you implement the assignment you’ll find face recognition is built just beyond face vertification. There’s no much difference between them.
To let the recognition much faster, we use oneshot learning. The idea of oneshot learning is to transform the original image into a 128 neuron as a vector. Then what we do is to compare the difference between two vectors generated by input image and image in the database. Of course, it is the convolutional computations that we need to generate vectors, which we called the whole process, the Siamese Network. You can view DeepFace paper by Taigman for detailed description on what is Siamese Network. Futhermore, the goal of learning is :
In order to train our neural network, we put forward with Triplet Loss as our cost function. If we define the image saved in the database Anchor
, abbreviated as A
, positive image(The right person picture) as P
, and negative image as N
. Then our triplet loss can be expressed as below:
F(A)F(P)^{2}  F(A)F(N)^{2} + ∂ ≤ 0 (∂ here is the margin)
Thus, the cost function can be written like this:
Give 3 images, A,P,N,
L(A,P,N)=max(F(A)F(P)^{2}  F(A)F(N)^{2} + ∂ , 0)
J=∑L(A,P,N) for all images.
Attention: Since triplet loss can be satified easily for most images, you need to find those images that are hard to train. Only in this way can we update the parameters to build a wellperformed machine.
Just to note that you can also use tanh activation as the final layer for the final output is just 0/1.
This funny deep learning way is published in the paper A Neural Algorithm of Artistic Style. You can view the paper yourself if you like. Here I only cover how to find our cost function in order to build our neural network.
Before implementing this technique, you need to know what’s the neural network actually do. Here is a visualization of neural network Visualizing and Understanding Convolutional Networks. The same as we defined before, we’ll again define 3 abbreviations: Content image as C
, Style image as S
and Generated image as G
. Next, we will define the content cost function and style cost function repectively. Then we just add them together with different parameters as our final cost function.
Content Cost Function
Here we just map the image into a vector, just like what we do in the face recognition (Here we use a pretrained VGG net as our map function). Then use L2 norm to measure the similarity of two images. The smaller value of content cost function, the more similar between these two pictures.
Style Cost Function
Again, we use a VGG net as our map function. But instead of choosing the output as our final value, a middle layer should be chosen in order to get the general style of Style image. Suppose here you get the output of layer l (your chosen layer),its shape is (n_H*n_W*n_C
), then we use this product to generate Gram matrix for the purpose of get the colrelation of two images through matrix. Here is the concrete step of getting our style matrix:
Then, the cost function can be defined as this:
Finally, we can define the cost function as :
J(G)=∂ J_{content}(C,G)+ß J_{style}(S,G)
Use this formula, you will get your own neural style transfer machine!
We’ve covered the detailed implement of triplet loss function. Here is the code:
1  def triplet_loss(y_true, y_pred, alpha = 0.2): 
Three steps are needed with our given function.
1.Compute the encoding of the image from image_path
2.Compute the distance about this encoding and the encoding of the identity image stored in the database
3.right if distance is less than 0.7, otherwise false.
Code:
1  # GRADED FUNCTION: verify 
You will find there’s little difference between this and face vertification. Just loop over the database dictionary to compare the image one by one.
1  # GRADED FUNCTION: who_is_it 
One thing to mention, we use transfer learning to shorten the process of training a new CNN. Here we use VGG19 for our Neural Style transfer.
We would like the “generated” image G to have similar content as the input image C. Suppose you have chosen some layer’s activations to represent the content of an image. In practice, you’ll get the most visually pleasing results if you choose a layer in the middle of the network–neither too shallow nor too deep. (After you have finished this exercise, feel free to come back and experiment with using different layers, to see how the results vary.)
So, suppose you have picked one particular hidden layer to use. Now, set the image C as the input to the pretrained VGG network, and run forward propagation. Let $a^{(C)}$ be the hidden layer activations in the layer you had chosen. (In lecture, we had written this as $a^{(C)[l]}$, but here we’ll drop the superscript $[l]$ to simplify the notation.) This will be a $n_H \times n_W \times n_C$ tensor. Repeat this process with the image G: Set G as the input, and run forward progation. Let $$a^{(G)}$$ be the corresponding hidden layer activation. We will define as the content cost function as:
1  # GRADED FUNCTION: compute_content_cost 
The style matrix is also called a “Gram matrix.” In linear algebra, the Gram matrix G of a set of vectors (v{1},…,v{n}) is the matrix of dot products, whose entries are G_{ij}=v_{i}^{T}v_{j}=np.dot(v_{i},v_{j}). In other words, $ G_{ij} $ compares how similar $v_i$ is to v_{j} : If they are highly similar, you would expect them to have a large dot product, and thus for G_{ij} to be large.
Note that there is an unfortunate collision in the variable names used here. We are following common terminology used in the literature, but $G$ is used to denote the Style matrix (or Gram matrix) as well as to denote the generated image $G$. We will try to make sure which $G$ we are referring to is always clear from the context.
In NST, you can compute the Style matrix by multiplying the “unrolled” filter matrix with their transpose:
The result is a matrix of dimension （n_{c},n_{c}) where $n_C$ is the number of filters. The value G_{ij} measures how similar the activations of filter $i$ are to the activations of filter $j$.
1  # GRADED FUNCTION: gram_matrix 
After generating the Style matrix (Gram matrix), your goal will be to minimize the distance between the Gram matrix of the “style” image S and that of the “generated” image G. For now, we are using only a single hidden layer $a^{[l]}$, and the corresponding style cost for this layer is defined as:
where $G^{(S)}$ and $G^{(G)}$ are respectively the Gram matrices of the “style” image and the “generated” image, computed using the hidden layer activations for a particular hidden layer in the network.
1  # GRADED FUNCTION: compute_layer_style_cost 
You can combine the style costs for different layers as follows:
Finally, let’s create a cost function that minimizes both the style and the content cost. The formula is:
Exercise: Implement the total cost function which includes both the content cost and the style cost.
1  # GRADED FUNCTION: total_cost 
Finally, let’s put everything together to implement Neural Style Transfer!
Here’s what the program will have to do:
That’s all for CNN part. I’ve learnt to read some primary paper and basic knowledge about Covoluntional Neural Network. Next I’ll go through the RNN, the sequence model!
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
This week we learnt about some CNN applications: object detections. There are several strategies to implement this task:
Now, let’s talk about them one by one~
You can find that all of the tasks we’ve done before are just about classification but not localization. Given a picture, just output 0/1 to classify if this image has a cat. But in this chapter we’d like to do not only classification, but also object localization which is to say, give the position of the detected object. In order to acommplish that, we are going to output 4 more values: bx,by,bh,bw. All of these four values depict the precise position of the object in the image. Thus, we are going to define a new format of output y, a (1+4+n_classes)*1
vector.
I don’t know what does this strange landmark matter. As Andrew says, we use landmark detection to help us know what’t the movement of someone currently. But he didn’t cover the details… Let’s omitted it.
In the lessons in Machine learning, Andrew has already covered some details about slide windows detections in maybe PCA?(mentioned the automatic driving chapter anyway). Suppose you have a window to detect the object you want. Then you just slide this window to check if the object exists here. To detect different objects of the different size, various sizes of windows are needed. When it comes to the implementation of sliding windows, you need to change the orignal output, a single number into a vector. Furthermore, just iterate over the whole windows position is much computationalexpensive. To handle with this shortcoming, we choose concolution implementation of sliding windows, which is to convolute the whole image into the final output, decrease much repeated computation.
In the previou part, we talked about sliding window dectection. Here we introduce another method used in YOLO algorithm: Bounding box. The main idea is like this: spilt the whole image into 3*3
grid cells,(or 19*19
,just for explanation). Then for each grid cell, we implement the object detection on it. And we only detect the object whose center is in this grid cell.
For the purpose of building an evaluation of object localization, we use intersection over union. Generally speaking, IoU is a measure of the overlap between two bounding boxes.
Say we have detect that there are two cars in the picture. But several windows are pointed to the same car. Additional objects should be omitted. Thus, nonmax suppression is a good idea. Here is the nonmax suppression algorithm:
If we want to detect multiple object in one position, say a woman is in front of a car, we want to detect both car and woman, the past bounding box detection can not work well. Therefore, someone put forward the anchor box. With two anchor boxes, each object in the training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU. Also, the correspoinding shape of output y should change like this:
We will go through this algorithm in this assignment, so let’s just skip it :)
This assignment is to implement YOLO model to detect the cars in the road. At the end of the assignment, you’ll have a well trained model to detect the cars in the road! Let’s begin.
First, we encoding the preprocessed image into a deep CNN, then get a encoding matrix.
Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.
For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).
Now, for each box (of each cell) we will compute the following elementwise product and extract a probability that the box contains a certain class.
As the steps mentioned above, you can write the belowing code implementation:
1  def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6): 
Here we only implement IoU: Nonmax suppression uses the very important function called “Intersection over Union”, or IoU.
1  def iou(box1, box2): 
You are now ready to implement nonmax suppression. The key steps are:
iou_threshold
.This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.
1  def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5): 
1  def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5): 
After running your model, you’ve seen your wellpredicted model on image detection.
In the next lesson, we’ll learn about face recognition and NN style tranformation.
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
The main content of this week is about several wellknown neural networks:
1*1
Convolution(Network in Network)Here let’s have a simple tutorial about these classic neural networks. Also, I’ll read some of the paper about these and then write my notes about them.
ResNet uses a new kind of neural networks connnection, which is called residual block. It let the outputs of some layers have a short cut or skip connection to the next few layers output. You can view this image:
1*1
Convolutions(Network in Network)There are some useful aspects for 1*1
convolution. For example you want to shrink the size of channels, 1*1
convNet can help you with it. Also, you can extract more details in the before layers’ output.
The main idea of Inception is to compute all kinds of convolutions and just stack them togather and let the data to choose the best fit layers for each floor, through parameters.
This is a single Inception module. You will find the whole Inception Net is concated with this one.
Here shows you how to build an intergrated neural network with Keras.
Note that Keras uses a different convention with variable names than we’ve previously used with numpy and TensorFlow. In particular, rather than creating and assigning a new variable on each step of forward propagation such as X, Z1, A1, Z2, A2, etc. for the computations for the different layers, in Keras code each line above just reassigns X to a new value using X = …. In other words, during each step of forward propagation, we are just writing the latest value in the commputation into the same variable X. The only exception was X_input, which we kept separate and did not overwrite, since we needed it at the end to create the Keras model instance (model = Model(inputs = X_input, …) above).
1  def model(input_shape): 
1  ### START CODE HERE ### (1 line) 
Now I will try to read some papers about these classic NN to help me learn more deeper about DL~
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
This course will give us a quick view of convolutional neural networks (CNN). For the week1, Andrew introduce a simple CNN with convolutional layer and pooling. Here is the list of this week’s lessons:
Now, let’s go through them one by one!
In the previous chapter, we have already trained a classifier with NN which can be successfully used to classify cats or dogs. However, in this case, we only use 64*64*3
image as our dataset. With the development of photos technology, the piexls of each image becomes larger and larger. Say if we have a 3000*3000*3
image, we decide to build several layers with fully connected, then the number of parameters we need to train is too large( over a million). Consequently, we need some method to help us reduce the size of our images. Thus we have the convolution. We use a matrixlike filter(kernel) to convolute the whole image. We use it to detect the edge in the different direction.
Also, even though there are a few literature about how to pick up a useful filter to detect the edge, with the development of deep learning, we now view these parameters as learning thing, which is to say, the filter values are learnt by applying learning algorithm with our dataset.
You can easily find that by simply applying convolutional computation, the size of output will shrink and the part of information is thrown away from the edge part. To deal with this problem, someone put forward with padding. What’s the padding really do is to round the orignal image with several lines with zeros(often, maybe some other values can also be used).
Therefore, we have two kinds convolutions: valid convolution and same convolution. The latter one is to use pad so that the output size is the same as the input size.
In the previous convolutional computation, we only move one space with respect to each step. Thus, the shrinkage of output size is not obvious. But if we change the movement space for each step, then the result will be largely different. This is what we called stride.
You can view the following image to check your understanding to the concepts of filter size, padding size and the stride size.
We have mentioned before that for each image, there are 3 channels for RGB. So an image is not only a 2D item, it’s a 3D one. The solution to this problem is to use a 3 volumes filter, which has a size of 3*3*3
. Also, if we want to detect different edges in the different directions, multiple filters are needed. Let’s take an example:
By using 2 different filters, we get two output with the same size. Then we stack these together to a 4*4*2
matrix. Until now, we have covered the most part of basic knowledge of convolutional neural network.
In this part, Andrew picked up a simple ConvNet.
You can see the image above to check your understanding. By applying convolution( step 1,2,3), we get a 7*7*40
matrix. Then flat this matrix to a 1960*1
vector, then you can use a logistic softmax activation to get the final output y
.
There are three different types layers for ConvNet. They are convolution, pooling and fully connected. The main idea of pooling is to use a mask to extract one number from each mask. The regulation for extraction has two categories: Max pooling and Average pooling. You can easily know what it does by simply reading the name of them.
This week assignment is extremely disgusting😠. The backpropagation for convNet is not mentioned but you need to implement them on your own. What’s more, the answers are just listed in the questions description. Just copy them from the assignment but don’t know why makes me really crazy. And for the application assignment, though I’ve written the same as the standard code, the output of my CNN still went wrong and cost value just can not converge! WTF ???
Convolutional Neural Networks: Step by Step
Here is the outline of this assignment:
Zeropadding adds zeros around the border of an image:
1  def zero_pad(X, pad): 
1  def conv_single_step(a_slice_prev, W, b): 
The formulas relating the output shape of the convolution to the input shape is:
1  def conv_forward(A_prev, W, b, hparameters): 
The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helps make feature detectors more invariant to its position in the input. The two types of pooling layers are:
These pooling layers have no parameters for backpropagation to train. However, they have hyperparameters such as the window size $f$. This specifies the height and width of the fxf window you would compute a max or average over.
1  def pool_forward(A_prev, hparameters, mode = "max"): 
This part is optional and ungraded, but I still finished it. If you want to view them, you can view my github~
Convolutional Neural Networks: Application
In this assignment, we will just use Tensorflow instead of numpy to help us avoid those disgusting computation. But I have met some troubles when writing this. But the main idea I think is right. Thus, I decide not to give my answer. You can view my github if you want.
This week lessons finally took me have a tutorial on how to build CNN. In the next few weeks, we will go deeper into CNN. Looking forward to that!
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment. There are a large amount of note just copy from Andrew Ng notes 😂.
Since there’s no assignment for this week’s course, I will focus on the knowledge about machine learning strategies mentioned by Andrew Ng. Here is the list of points of this weeks’s lessons:
Now let’s go through these knowledge points one by one~
Error analysis is of great benefit to your machine learning building. It can help you quickly address which direction is the most valuable to try and optimize. Say if we are buliding a cat classification to your machine learning project, then what you need to do is to list the data generating from your dataset:
error rate  

Dog mislabeled  20% 
Great cat mislabeled  70% 
Blurry mislabeled  20% 
You can find that this case, you probably need to focus more attention on Great cat mislabeled since you can find that 70% of mislabeled cases are great cat. Thus, you’ve found what’s the next step you and your group should do.
Here, I am not willing to take correcting incorrect dev/test set examples as a separate point to talk. Only a couple of points should be clear:
In some cases, you may not have the same distributions with your training set and your dev set. In the previous chapter, we always emphasize the dataset should come from the same distribution. But currently, there’s something different. You may need to gurantee that your dev and test set should contain most of your expected dataset. Say if we want to do a classification that deals with a cat classification on a local dataset with size of 10000 examples and you scraped 200000 examples from the Internet as your training dataset. The only difference between them is that the local one are mostly blurry and the latter one are mostly high quality and easytoclassify. Consequently, you need to make sure that your dev and test set have at least 2500 examples of the local dateset in order to make sure that your model really does a great job on your local dataset, which is your expected dataset.
Continue with the topic we talked above, you can find that you want to make a classification on some blurry images of cats, but you feed your model with plentiy of high quality and easytorecognized images. This may results in data mismatching. You may find your model performs well on training set but the performance is really poor on your dev set. Of course, it may be the overfitting of the training set, however, another probability is data mismatching. But how can you address whether this is overfitting or data mismatching? Andrew provides us with this way:
Set up a trainingdev set from your training set. Then just compute the following matrices:
If the data mismatching value is much larger than variance value, then you can be sure of that your model has data mismatching. But how can we tackle with data mismatching? Unluckily, there’s no sysmatical way to deal with this problem, but you can try some:
In this lesson, Andrew only told us a little institutions about what does the tranfer learning do and some pros and cons about tranfer learning. Here is when transfer learning makes sense:
Multitask works well than single NN for each tasks. Say, it will be better to recognize signals, pedestrians,cars and so on simultaneously than to train 4 separate models for each object recognition. Here is when multitask learning makes sense:
This is a new method put forward with the development of dataset size and width. What the endtoend deep learning do is just to feed your model data and then your model gives you the results. You know, in the past days many tasks should follow a list of pipelines before the results. But if you have enough data(really large), endtoend deep learning method may help you avoid these troublesome pipelines. The key question is: Do you have sufficient data to learn a function of the complexity needed to map x to y?
A course without any programming assignment is really boring😭 But I still learnt some useful machine learning strategies in the processing of building up a better model. Next course is about CNN. Hope it will be much more funny:)
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
Since there’s no assignment for this week’s course, I will focus on the knowledge about machine learning strategies mentioned by Andrew Ng. Here is the list of points of this weeks’s lessons:
Now let’s go through these knowledge points one by one~
Orthononalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. In one word, if you view these functions as buttons, then each button only controls one property. Thus, it will become easier to vertify the algorithms independently from one another.
When you build a supervised learning system, then 4 assumptions should be assured and orthogonal:
I’ve already read about these metric when I was learning Machine learning in the last summer. They are Precision, Recall, F1score. Here is the formulas for computing them:
But sometimes, we want to do some tradeoff between precision and recall, then the F1score is a good product for this case. It is a harmonic mean, combining both precision and recall. Here is the formula:
Here, p
refers to precision and r
refers to recall.
In the last chapter we have learnt about F1score, precision and recall. However, there are different matrics to evaluate the perfomance of a classifier, they are called evaluation matrices and can be categorized as satisficing and optimizing matrices. For satisficing matrices are those we must meet the lowest need, but once we meet the standard, we may not care so much about it. For optimizing matrices, it is the most important one we want to optimize, as far as possible after we achieve the standard of satisficing matrices.
3 aspects should be focus on.
Firstly, you need to gurantee that the data of your training set, dev set and test set should be the same distribution. A recommanded way to generate these kind of dataset is to mix all the data you collected before and separate randomly into training set, dev set and test set.
Secondly is the size of these set. In the past days, we often use 73 split for our dataset. Nowadays, since more and more data is accessible, we don’t have to compromised as much and can use greater portion to train the model. Maybe 9811 is just ok. Here is the guidelines provided by Andrew(but I think they are not so instructive and practical):
The last is to customize your metrics with respect to your application. Say if you want to train a classifier for cats, Algorithm A achieves a classification error 3% and algorithm B achieves 5%. But algorithm A may mistaken some pornographic images as cat image, which can not be allowed or tolerated by your company and your users. Thus algorithm B should be the one to choose. What’s more, in this case, another way is to modify your cost function which can be written as follow:
Before we dive deeper into our Machine learning strategies, let’s talk about Humanlevel performance which is often mentioned in Deep learning area.
Bayes optimal error is defined as the best possible error. In other words, it means that any function mapping from x to y can not surpass a certain level of accuracy. We often view humanlevel performance as bayes optimal error in unstructure era, such as natural information processing.
Furthermore, you can easily find the progress becomes slowly when the accuracy of machine learning method surpass humanlevel performance.
By knowing what the humanlevel performance is, it is possible to tell when a training set is performing well or not. Say in scenario A, humanlevel performance reaches 1% classification error, our machine learning method reaches 8%, and the development error is 10%, then if we view the humanlevel performance as Bayes optimal error, the avoidable bias is 8%1%=7%, the variance is 10%8%=2%. In this case, what you need to do is try different ways to reduce your bias since you may have more space to optimize in bias(7%) than in variance(2%). But in scenario B, you may have a 0.5% bias and a 2% variance, then you are suggested probabaly to think about how to reduce the variance.
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
The main content of this week is also about imporving your Neural Network. Here is the list of points mentioned in this week’s lessons:
I’ll talk about them in details as far as I can in the following passage. Since the assignment this week is not so related to this week’s content, I need to summarize them myself😭.
This is the first point talked in this week’s lesson. There are two method tackle with hyperparameters tuning:
Let’s explain them with some vivid examples.
For the first case, if you use grid search, then the number for one parameter you search will be smaller than using random search. Like this image shows:
If you use grid search, then you can only try 5 different values for hyperparameter1 or 2. But if you use random values like method in the right, you can find more values can be tried in 25 attempts, which will help you find better tuning scale in a few times.
For the second case, the phase ‘coarse to fine’ means that try a large scale at first and then decrease the scale step by step in order to find the best hyperparameters value.
Then, there are some additional tips for you. Use a log function to help you find parameter value. Say, you need to find a good value between 0.001 and 1, a recommanded way is like this:
1  temp=3*np.random.randn() 
Another method for choosing hyperparameter is to use exponentially weighted averages.(you can go online to search further imformation since I haven’t understund it yet😂)
Remember regularization we talked about in the first week? Have you thought about if we regularize parameters for each layer, the velocity on training will speed up? That’s the main idea for batch norm. The whole process is like this:
Google has describled batch normalization in ICML clearly: for each SGD, regularize activation using minibatch. Through this way, it let the average to be 0 and variant to be 1.
For logistic regression, we use sigmoid activation to output 0/1. But if we want to output categorical values(>=2), softmax is a new activation which integrates sigmoid.
I’m not familiar with it now. So I just mention some grammar in assignment.
Writing and running programs in TensorFlow has the following steps:
1  import numpy as np 
I don’t know what’s this thing is on earth and its function. But you can just view it as a placeholder or a variable that haven’t been given the value. Since the idea for tensorflow is to picture a compute graph without any exact values at first, this method fits it well. Just use feed_dict when you run a model to pass the exact numbers to placeholder.
For sigmoid Function:
1  def sigmoid(z): 
Many times in deep learning you will have a y vector with numbers ranging from 0 to C1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows:
This is called a “one hot” encoding, because in the converted representation exactly one element of each column is “hot” (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code:
1  # GRADED FUNCTION: one_hot_matrix 
There are many builtin cost computation functions that help us with compute costs. Here we use tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...))
1  def compute_cost(Z3, Y): 
I think this week assignment has little to say. Thus I think you can just skip it(hhh). For the usage of tensorflow and other deep learning framework, like keras, I think I’ll write more blogs about them. But for now, I think the theoritical knowledge is the most important. Let’s go on to the course 3!
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
The main content of this week is about optimization method. Here is the list for the points of this week’s lessons:
I’ll mention them in details in the following part. The optimization is an important part in machine learning and deeping learning for it helps us a lot to train better neural networks with large training set. Say, if 100 hours is required to train a NN using batch gradient descent, then maybe 10 hours is enough to do it by applying Adam algorithm. These methods make us spend more time focusing on iterate our models rather than waiting for the training results. Let’s see what’s the optimizatoin method on earth.
This is simple optimizaton method in machine learning and we have used it through the past 4 assignments. So I won’t bother to show you the process of GD. A variant of GD is Stochastic Gradient Descent(SGD), which is equivalent to minibatch gradient descent where each minibatch has just 1 example. The code examples below illustrate the difference between SGD and GD.
(Batch) Gradient Descent
1  X = data_input 
Stochastic Gradient Descent
1  X = data_input 
When the training set is large, SGD can be faster but the parameters will oscillate toward the minimum rather than converge smoothly. Here is an illustration of this:
MiniBatch GD is also a common way to accelerate the training process. It is shown in Machine Learning course taught by Andrew Ng if I don’t mistake it. Two steps are needed to build minibatches from the training set(X,Y). Here are two steps:
mini_batch_size
. Note that the number of training examples can not always be divided by mini_batch_size
, thus the last mini_batch size may be less than pointed mini_batch_size
which should look like this:Through this way, you don’t need to traverse the whole training set each epoch, just a minibatch is ok.
Since we don’t pass through the whole training set each step, just a subset of examples, we won’t get a direct derivation of current cost function. Consequently, the path taken by minibatch gradient descent will oscillate toward convergence. Using momentum can reduce this oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v
. Formally, this will be the exponentially weighted average of the gradient on previous steps.
v
.Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.
How it works
where:
The easiest part in this homework:
1  def update_parameters_with_gd(parameters, grads, learning_rate): 
Note that the last minibatch might end up smaller than mini_batch_size=64
. Let $\lfloor s \rfloor$ represents $s$ rounded down to the nearest integer (this is math.floor(s)
in Python). If the total number of examples is not a multiple of mini_batch_size=64
then there will be $\lfloor \frac{m}{mini_batch_size}\rfloor$ minibatches with a full 64 examples, and the number of examples in the final minibatch will be ($mmini__batch__size \times \lfloor \frac{m}{mini_batch_size}\rfloor$).
The code notation is quite useful for us to understand the execution of minibatch generation.
1  def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0): 
You can view the steps mentioned above to understand the following code.
Initialize parameters v
1  def initialize_velocity(parameters): 
Implement update process with momentum
1  def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate): 
Here we should initialize 2 extra parameters, v and s.
1  def initialize_adam(parameters) : 
And according to the steps implementing adam algoritm, you can view the code below to verify your understanding to adam.
1  def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01, 
Though the code above is fu** long but you just need to focus on 4 parts which are between ### START CODE HERE ###
and ### END CODE HERE ###
.
Finally, the assignment use a dataset to show us the training differences when using GD, minibatch GD, momentum and adam.
For minibatch GD method, you can get this image after 9000 epoch:
For momentum, you can get this image after 9000 epoch, of course, you can see that this has little improvement comparing to the former one.
The last one is using Adam optimization, which have distinct improvement on accuracy by almost 20% after 9000 epoch.
After this assignment, which method to choose seems obvious. You can view my github to view the programming code in details.
]]>This is the notes for Improving Deep Neural Networks taught by Andrew Ng at Coursera. Here, lesson notes and assignments will be provided in order to enhance my comprehension about Neural Networks. You can view my github for the programming assignment.
There are three part in this week lessons:
Three individual assignments corresond to the above parts. Here I just list some of the most important parts (in my perspective).
First of all, initialize w
parameter with zeros should be avoided since our neural network wouldn’t be able to break symmetry. This problem has been explained last week in Ng’s course with a concrete example. I won’t bother it any more. Therefore, you should initialize the parameter w
with random values.
Another point needed to be mentioned is, what’s the scale should be initialized with respect to w
? If you have heard of ‘Xavier initialization’, this is similar uses a scaling factor for weights W
^{[l]} of sqrt(1/layers_dim[l1])
where He initialization would use sqrt(2/layers_dims[l1])
. Since if we initialize our w
too large, then J would be kind of large which may be diffcult for converging (As we mentioned before, the derivation of activation at a relativedly large or small value is very small).
This technique is widely used both in deep learning and machine learning. Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen! Thus, we can use regularization to help us avoid this problem. There are two common ways to implement this technique: L1/L2 regularization and Dropout.
This method is also used in machine learning with linear regression and logistic regression. First you need to centralize your data (In other words, let variance to be 1 and mean to be 0). Then, we use a regularization item which may be either L1 norm or L2 norm, it depends. Here is a math expression:
And here is a brief illustration about what is L2regularization actually doing:
L2regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Dropout is an unique way with respect to regularization for neural networks but is also widely used. It randomly shuts down some neurons in each iteration. And why the dropout method makes sense?
When you shut some neurons down, you actually modify your model. The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
When it comes to the implementation about dropout, it is quitely easy. Just use a mask each time you do a front propagation and a back propagation. And the mask for these two process should be the same mask. The mask is a boolen matrix which has the same size as the original neural matrix.
Here is a small story about dropout provided in assignment:
To understand dropout, consider this conversation with a friend:
 Friend: “Why do you need all these neurons to train your network and classify images?”.
 You: “Because each neuron contains a weight and can learn specific features/details/shape of an image. The more neurons I have, the more featurse my model learns!”
 Friend: “I see, but are you sure that your neurons are learning different features and not all the same features?”
 You: “Good point… Neurons in the same layer actually don’t talk to each other. It should be definitly possible that they learn the same image features/shapes/forms/details… which would be redundant. There should be a solution.”
And here is a video helps you understand better:
In assignment we will dive deeper into implementation on dropout.
This part is an optional one, I think. Since nowadays, less and less requirements are need to compute the derivation of cost function (For we have framework like tensorflow and keras :) ) However, this part also helps me with my skill at numpy. And I think this part of asssignment is the only one that wastes me 10 mins because the assignment has some small mistakes. I will mentioned them latter. Here we only focus on how to use gradient checking. The formula is provided:
$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon)  J(\theta  \varepsilon)}{2 \varepsilon} \tag{1}$$
If we let theta to be relatively small, then our computed gradient and the value computed with the above formula should be comparatively similar. The method to numerate this similarity, we use the following formula:
$$ difference = \frac {\mid\mid grad  gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2} \tag{2}$$
I think I’ve covered the main part of this week’s content. Now let’s have a look at assignment.
In this assignment, only six lines are needed if you just want to finish the assignment. And all of them are quite easy. But we should know that a well chosen initialzation can:
For relu function, we often use He initialization: just multiply the original parameter w
with $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$ . Here is an sample on how to implement “He initialization”:
1  def initialize_parameters_he(layers_dims): 
Here is the dataset we use in this assignment
A nonregularizaed model has more probability on overfitting the training data which will give a notsoperfect prediction on test set, like this:
Though it get an accuracy of 94.7% on training set, it only achieve 91.5% accuray on test set.
Here, we use a L2 regularization to see if anything changed.
The same image is shown to enhance your comprehension to this process.
For forward propagation, we need to add a regularization item:
1  def compute_cost_with_regularization(A3, Y, parameters, lambd): 
Also, since we add a new item, the derivation of cost function with respect to theta also changed. For each, we have to add the regularization term’s gradient ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$).
1  # GRADED FUNCTION: backward_propagation_with_regularization 
Here is a image of our L2regularization model:
You can find that though the accuracy on training set drops a little, the accuracy on test set improve by 2 %! This is the power of regularization 😃
The last regularization method– dropout is the main point we need to discuss. The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons. Dropout can be divided into two different part: one is in forward propagation and the other is in back propagation part. For each layer, you need implement the following steps:
np.random.rand()
to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix $D^{[1]} = [d^{(1)[1]} d^{(2)[1]} … d^{(m)[1]}] $ of the same dimension as $A^{[1]}$.1keep_prob
) or 1 with probability (keep_prob
), by thresholding values in $D^{[1]}$ appropriately. Hint: to set all the entries of a matrix X to 0 (if entry is less than 0.5) or 1 (if entry is more than 0.5) you would do: X = (X < 0.5)
. Note that 0 and 1 are respectively equivalent to False and True.keep_prob
. By doing this you are assuring that the result of the cost will still have the same expected value as without dropout. (This technique is also called inverted dropout.) Honestly speaking, I can’t understand what’s the fucking first step saying😂. I think what you need to do is just generate a new matrix. Its values are all between 0 and 1. Then use keep_prob to get a boolen matrix as a mask. Here is the code for a 3 layers neural network :
1  def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5): 
For backward propagation, you also have to implement two steps:
A1
. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[1]}$ to dA1
. A1
by keep_prob
. In backpropagation, you’ll therefore have to divide dA1
by keep_prob
again (the calculus interpretation is that if $A^{[1]}$ is scaled by keep_prob
, then its derivative $dA^{[1]}$ is also scaled by the same keep_prob
). But I think the code illustrate this process more understanding.
1  def backward_propagation_with_dropout(X, Y, cache, keep_prob): 
And here is our image for dropout method. You can see it works rather well. And the accuracy on test set has improved to 95! How magic it is!
What’s more:
Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system.
This part is to check if your backward propagation is correct. Here is the instructions on how to implement gradient checking:
Instructions: Here is pseudocode that will help you implement the gradient check.
For each i in num_parameters:
J_plus[i]
:np.copy(parameters_values)
forward_propagation_n(x, y, vector_to_dictionary(
$\theta^{+}$ ))
. J_minus[i]
: do the same thing with $\theta^{}$Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]
. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute: $$ difference = \frac { grad  gradapprox _2}{ grad _2 +  gradapprox _2 } \tag{3}$$
here is part of code:
1  def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e7): 
Attention: you won’t get the expected output written in the material even if your answer is correct. There is a little difference on difference
value.
This is the first assignment for course 2. I’ve learnt a lot in this week, initialization, regularization, centralization and gradchecking. Let’s go on with the next week material!
]]>这是对应于Andrew Ng大大在cousera上开的neural network and deep learning 第三次作业（即week4 的assinment）因为没有visa卡只能上网易云课堂上，作业还是从网上找的流传的版本QAQ。本文针对该周课程进行一定的复习并对作业内容进行解析。
这里Ng先让我们实现一个双层神经网络，然后再让我们实现一个n层的神经网络。两个对比着实现了，估计是想让我们理解二者的相似之处和某些区别（好像没啥区别）反正写的时候变换来变换去的蛮烦的= = 最主要的是多层函数的反向传播的实现过程，掌握了这个就没啥了，这里Ng给了我们一张计算的图像便于理解
以及这里是一个双层神经网络的计算过程
这里我们先拿双层神经网络说吧。首先从输入层(A_{0})到达我们的隐藏层，然后是从隐藏层到输出层，这样就经过了两层的神经网络。其中，hidden layer的activation采用的是relu函数，输出层采用的是sigmoid函数。因为我们这里是0/1判断问题，所以输出层的activation采用sigmoid函数。其余时候基本上都不用simgoid函数因为他的收敛速度比较慢。然后就是具体的计算向前传播的过程了，也就是计算损失函数的过程。
对于每一层layer l来说，
$$Z^{[l]} = W^{[l]}A^{[l1]} +b^{[l]}\tag{4}$$
其中 $A^{[0]} = X$.
然后计算出来的Z^{[l]}要被送入activation。Ng的作业里提供了写好的sigmoid和ReLu给我们了：
Sigmoid: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{(W A + b)}}$. We have provided you with the sigmoid
function. This function returns two items: the activation value “a
“ and a “cache
“ that contains “Z
“ (it’s what we will feed in to the corresponding backward function). To use it you could just call:
1  A, activation_cache = sigmoid(Z) 
ReLU: The mathematical formula for ReLu is $A = RELU(Z) = max(0, Z)$. We have provided you with the relu
function. This function returns two items: the activation value “A
“ and a “cache
“ that contains “Z
“ (it’s what we will feed in to the corresponding backward function). To use it you could just call:
1  A, activation_cache = relu(Z) 
而对于L层的神经网络，我们只需要将之前的hidden layer重复l1次就行了。这里采用for循环，由于这里是顺序执行问题所以不可能采用矩阵优化并行计算。
同时这里我们需要注意，由于我们之后的反向传播过程会用到每一层layer计算出来的Z值，所以我们在进行向前传播时需要将这些Z值给记录下来。当然还有对应的w，b值，他们都作为缓存放在dict中等待使用。
这里我们将向后传播分为两个部分，把求dZ作为一部分，然后把求$(dW^{[l]}, db^{[l]} dA^{[l1]})$作为一部分。
对于后者，如果已知了dZ我们有如下公式参考：
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l]} \tag{9}$$
$$ dA^{[l1]} = \frac{\partial \mathcal{L} }{\partial A^{[l1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$
可以参考下面这张较为具体的实现图：
然后求dZ就比较水了，因为Ng为我们提供了内置函数了，只要把dA传给一个函数自然就得出dZ了，但是我们还是要知道dZ的计算过程是
$$dZ^{[l]} = dA^{[l]} * g’(Z^{[l]}) \tag{11}$$
这里仅对某些点进行解析，如访问作业完整内容可以去我的github
这个部分最重要的就是要分清楚每个W，b的矩阵shape，其中对于每个layer它的W是(当前layer的size，前一个layer的size)，b的shape是(1,当前layer的size)。这个问题我算是在assignment2里面学到教训了= = 不会忘了😭
这里我只给出对于L个layer的神经网络初始化
1  def initialize_parameters_deep(layer_dims): 
以及这里需要提到一点，它将cache分为linear_cache和activation_cache。其中，linear_cache里面放的是(A,W,b)。而activation_cache则是放的Z，这个一定要区分好，我之前就是没注意这两个东西结果把两个位置给写反了结果导致最后一步计算cost function的计算错误。
然后我们把这些东西给整合成一个完整的l层layer的model的向前传播函数如下
1  def L_model_forward(X, parameters): 
Attention 这里有一个点需要注意的就是每次我们计算会计算出dW,db,和dA,这里的dA是前一个layer的A的求偏导。
3次作业都用的一个损失函数
$$\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{(i)[L]}\right) + (1y^{(i)})\log\left(1 a^{(i)[L]}\right)) \tag{7}$$
1  def compute_cost(AL, Y): 
对于某一个layer我们进行分析的话
对应的计算公式也写好了= =
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l1] T} \tag{8}$$$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{(i)[l]}\tag{9}$$$$ dA^{[l1]} = \frac{\partial \mathcal{L} }{\partial A^{[l1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$
1  def linear_backward(dZ, cache): 
对于整合出来的这个def L_model_backward(AL, Y, caches):
注意一点，就是它的第一步dAL是不算的，你第一次算出来的grad[‘dAL’]和之前的这个没啥关系= = 最后就是整合所有的每一层的反向传播。这里很奇怪，题目要求要写5行左右结果我一行ko了 不是很懂 是不是我少写了什么。。。
1  def L_model_backward(AL, Y, caches): 
最后对每一步后进行更新w和b就可以啦～
第四次的作业是一个应用，比较简单，就是通过前一次的几个函数调用形成一个完整的神经网络来预测图像是否是一只猫。。。
以下是一个完整的L_layer的model,并且根据每次的cost绘制出图像：
def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):#lr was 0.009 """ Implements a Llayer neural network: [LINEAR>RELU]*(L1)>LINEAR>SIGMOID. Arguments: X  data, numpy array of shape (number of examples, num_px * num_px * 3) Y  true "label" vector (containing 0 if cat, 1 if noncat), of shape (1, number of examples) layers_dims  list containing the input size and each layer size, of length (number of layers + 1). learning_rate  learning rate of the gradient descent update rule num_iterations  number of iterations of the optimization loop print_cost  if True, it prints the cost every 100 steps Returns: parameters  parameters learnt by the model. They can then be used to predict. """ np.random.seed(1) costs = [] # keep track of cost # Parameters initialization. ### START CODE HERE ### parameters =initialize_parameters_deep(layers_dims) ### END CODE HERE ### # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: [LINEAR > RELU]*(L1) > LINEAR > SIGMOID. ### START CODE HERE ### (≈ 1 line of code) AL, caches = L_model_forward(X,parameters) ### END CODE HERE ### # Compute cost. ### START CODE HERE ### (≈ 1 line of code) cost = compute_cost(AL,Y) ### END CODE HERE ### # Backward propagation. ### START CODE HERE ### (≈ 1 line of code) grads = L_model_backward(AL,Y,caches) ### END CODE HERE ### # Update parameters. ### START CODE HERE ### (≈ 1 line of code) parameters = update_parameters(parameters,grads,learning_rate) ### END CODE HERE ### # Print the cost every 100 training example if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(np.squeeze(costs)) plt.ylabel('cost') plt.xlabel('iterations (per tens)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters
我用训练好的模型预测了一只长的很正太的猫 然后成功了= =
同时这个模型在测试集上跑也有80%的accuracy，可以说是比我们原来的单层logistic regression好多了。
到此我们的第一次课的所有作业都ko了，觉得神经网络并没有想象中的那么复杂呢= =好吧 我只是初窥门径
]]>这是对应于Andrew Ng大大在cousera上开的neural network and deep learning 第二次作业（即week3 的assinment）因为没有visa卡只能上网易云课堂上，作业还是从网上找的流传的版本QAQ。本文针对该周课程进行一定的复习并对作业内容进行解析。
本周的内容，Ng给我们展示了一个简单的多层NN，即双层神经网络，一个隐藏层和一个输出层。这里我们同样采用随机梯度下降法来进行迭代训练参数。和Logistic regression没有太大的区别，哦，唯一比较难的一点就是对于从隐藏层到输入层的反向传播中需要进行比较麻烦的链式求导，当然对于高数大佬来说不能算难= =，然鹅我这个高数中的人就很难受了qwq
向前传播即计算损失函数的过程，这里就是求cost function。其具体过程见下：
对于某一个样例 $x^{(i)}$ 来说:For one example $x^{(i)}$:$$z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}$$ $$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}$$$$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$$$y^{(i)}_{prediction} = $$ (条件判断语句 好像mathJax渲染不出来了。。。)
反正你只要知道如果a^{[2]}^{(i))}>0.5的时候取1，小于0.5的时候取0就行了= =
然后呢 你就可轻松的计算出cost的function如下：
$$J =  \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1y^{(i)})\log\left(1 a^{[2] (i)}\right) \large \right) \small \tag{6}$$
这里先上一张Ng的图来讲，看着比较清晰
根据上图，我们可以依次倒着写出我们需要求导的参数计算公式:
其实我们做的就是计算如下的几个等式:
The notation you will use is common in deep learning coding:
每次更新参数的方式如下：
$ \theta = \theta  \alpha \frac{\partial J }{ \partial \theta }$ 其中 $\alpha$ 是收缩率也即learning_rate, $\theta$ 是我们的参数(w1,b1,w2,b2)。然后把这个数据集日几遍就训练成为我们的神经网络咯～
这里仅对某些点进行解析，如访问作业完整内容可以去我的github
可以看到，这是一个明显线性不可分问题，如果用logistic regression的话只有47%的正确率= = 所以应运而生了我们的Neural Network model。
这里的参数大小需要进行特别说明。
对于每一个layer来说，都对应有唯一的w_{i},b_{i},activation function(sigmoid,tanh,ReLu)。而对于w_{i}的大小来说，它应该是(当前layer的size，之前一个layer的size)，b_{i}是(1,当前layer的size)
1  def initialize_parameters(n_x, n_h, n_y): 
这里当然就是我们之前讲到的向前传播算法，注意一下矩阵相乘时size的配对（转置问题）就行了。
1  def forward_propagation(X, parameters): 
即计算损失函数，这里提供的等式是
$$J =  \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1y^{(i)})\log\left(1 a^{[2] (i)}\right) \large{)} \small\tag{13}$$
作业里让我们用np.multiply做但我不知道为啥= =感觉直接elementwise multiplication就行了啊
1  def compute_cost(A2, Y, parameters): 
这里第一层的activation function采用tanh，所以：
To compute dZ1 you’ll need to compute $g^{[1]’}(Z^{[1]})$. Since $g^{[1]}(.)$ is the tanh activation function, if $a = g^{[1]}(z)$ then $g^{[1]’}(z) = 1a^2$. So you can compute $g^{[1]’}(Z^{[1]})$ using (1  np.power(A1, 2))
.
如果不记得那几个求导后的等式可以往上翻看之前的内容。还有一点！！！坑的我多看了5分钟！！！就是好像通过等式算出来的db1求和出来的是一个(layer_size,1)和我们原来设置的b1的shape(1,layer_size)刚好反的，这样一更新就会把b1自动拓展成(4,4)的矩阵。。所以我这里reshape了一下。
1  def backward_propagation(parameters, cache, X, Y): 
然后就是更新参数啦，这里可以选择合适的learning_rate以趋近目前最优解。最后就是整合模型为一个完整的NN。
1  def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False): 
我们把刚才写好的NN拿过来跑了一下发现 的确不一样了呢！之后又通过对size of hidden layer进行探索，而对于当这个值过大后容易产生overfitting,不过往后可以使用regulation来解决这个问题。所以这次作业就哟呦呦ko啦～下次作业再见👋
]]>这是对应于Andrew Ng大大在cousera上开的neural network and deep learning 第一次作业（即week2 的assinment）因为没有visa卡只能上网易云课堂上，作业还是从网上找的流传的版本QAQ。本文针对该周课程进行一定的复习并对作业内容进行解析。
这里Ng使用一个logistic regression的模型来说明如何构建一个单层的神经网络并且解释了训练的过程。这里学习的方式我们采用随机梯度下降法，通过寻找到梯度最大的方向迭代我们的参数，直到损失函数值趋于稳定。
即对于某一 $x^{(i)}$:$$z^{(i)} = w^T x^{(i)} + b \tag{1}$$$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}$$ $$ \mathcal{L}(a^{(i)}, y^{(i)}) =  y^{(i)} \log(a^{(i)})  (1y^{(i)} ) \log(1a^{(i)})\tag{3}$$
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{6}$$
首先这里的向前传播即通过该单层神经网络根据输入X和当前参数w和b计算出损失函数的值的过程，即自右向左得出L(a,y)的值。而向后传播算法即根据链式求导法则计算出dL/dw 和 dL/db 的值然后不断迭代从而达到训练模型的目的。
由微积分知识可以得到如下向后传播的结果
$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(AY)^T\tag{7}$$$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}y^{(i)})\tag{8}$$
这里需要设定一个learning_rate，也即收缩率(shrinkage)，用来控制w和b改变的速率大小，如果过大容易产生震荡，如果过小则训练速度太慢。
这里我们的目标就是训练 $w$ 和 $b$ . 对于 $\theta$, 更新的方法是 $ \theta = \theta  \alpha \text{ } d\theta$, 其中 $\alpha$ 就是之前提到的 learning rate.
这一步较为简单不做说明。
本次作业要求如下:
1.Build the general architecture of a learning algorithm, including:
2.Gather all three functions above into a main model function, in the right order.
首先是一个sigmoid function。
1  # GRADED FUNCTION: sigmoid 
然后是向后和向前传播算法，返回损失函数和梯度
1  # FORWARD PROPAGATION (FROM X TO COST) 
这里需要注意到np.dot()和*
的区别，前者是矩阵运算，后者是elementwise operation。
而所谓的训练过程就是把数据集日几遍，即keras里面的echo值。最终得到w和b，训练结束。
1  for i in range(num_iterations): 
The homework also plot the learning_rate & cost figure. 这里我选择了0.01,0.003,0.0001作为几个值，图像如下
可以看到当learning_rate=0.01时，损失函数的值前期的值震荡的很厉害，最后达到稳定，但是overfitting了。而0.03是一个不错的值，它在test_set的表现比其他两个都要好。
但是最后它让我们自己给图片进行预测，我随便找了一张，然后就跪了。。。
毕竟只是一个logistic regression。效果说不定都比不上randomForest。不过我相信接下来的神经网络一定能够胜任这个任务！
你可以访问我的github以详细的查看作业的具体过程。
]]>PhantonJs is a tool that can help us render js code when we browse websites online. Seledium can control our computer( to some extent) and manipulate browser to input what we want. In a word, it can imitate humans activity when using browser. Thus, if we can combine these two sword together, it would be much convienent for us to crawl web page.
This post gives you a simple example on how to crawl information about accessible hotels in Shanghai tonight.
You can view here to check how to install Chrome seledium properly. Maybe you need to download ChromeDriver if you can run your seledium correctly. Put this file in your current content. Then the code below will get this Chrome driver:
1  options=webdriver.ChromeOptions() 
1  import datetime 
As we always do, check the element we need. First visit website: ‘https://hotel.qunar.com‘ to find element we want.
From the image above, we can generate these code:
1  # find the element we need: the city, 
Till now, if you run your code, the below image will be shown:
Scroll the page to the bottom, we can find that the hotels in each page are generated in two seperated times. Only if your scroll to the bottom, the second part will be loaded. So here we need to use phantomjs to let it scroll to bottom automatically.
1  # continue to scroll to end 
Atention: we need to scroll to the bottom twice, since after the second hotel load, we are in the middle of the page other than the bottom.
Here we use BeautifulSoup to parse the html source. You can find that all the hotel information is in a div which has the id equals to ‘jsContentPanel’.
Then just process the text we get in the html and write them into hotel.txt
.
1  # get the html source of current page 
We use WebDriverWait
to find the button next_page. Attention here, you may need to sleep for about 2 sec. Otherwise, it will probably raise Exception:
1  Message: stale element reference: element is not attached to the page document 
I don’t know why but when I waited for a moment, the exception disappeared.
1  # find the next page buttion and click to the next page 
I think the hotel imformation should be selected elaborately since it looks kind of missy.
You can view here for entire code. The anotation is also detailed. If you have some problem about this code, just comment below.
]]>人类每一秒的思考都是离不开语境的。当你读这篇文章的时候，你对每一个单词的理解都是在你在阅读前文的基础上进行的。你不会在看每一个单词的时候都把之前的思考给忘记从零开始。换句话说，你的思考是具有连续性的。
传统的神经网路不能够做到这点，这也是它的一大缺点。例如，如果你想要甄别电影的每一幕在发生什么，那么传统的神经网络似乎并不知道如何讲之前的信息用于下一次的预测中。
循环神经网络解决了这个问题。他们是具有循环的，从而允许之前的信息能够保持。
Recurrent Neural Networks have loops
在上图中，一个神经网络A，在读入x_{t}后输出h_{t}。这个循环可以让之前的信息在下一次的预测中被使用。
这些循环让RNNs看起来有些神秘。到那时如果你想的更细致一些，你就会发现他们和传统的神经网络没有太大的区别。一个RNN网络可以被看成多个同样的网络的集合，每一个神经网络都会给下一个神经网络传递信息。如果我们展开这个RNN，想想下面这张图会发生什么:
An unrolled recurrent neural network
这个链式结构反映出RNN其实和序列问题密切相关。他们就是专门用来处理这类数据的神经网络。
并且人们的确经常使用他们！在过去的几年里，RNNs如下问题中取得了重大的成就：语言识别，语言模型，机器翻译，图像识别… 等等。至于对人们可以用RNNs做到的很多牛逼的事情，你可以参考Andrej Karpathy的一篇很棒的博文不可思议的RNNs。
想要达到这种效果，一个必不可少的东西就是–“LSTM”，一个对RNNs的改造但是它被用于很多情景下，并且比标准的RNN要表现的更好。几乎所有的优秀成果都是基于LSTM的，这也正是我们今天要探索的。
RNNs的一大亮点就是它可以将之前的信息和现在的预测联系起来，比如用之前的视频画面来解释当前的画面。如果RNNs可以做到这样，那么这个功能将会非常有用，但是能吗？这要视情况而定。
有些时候，我们只需要看最近的信息就可以对现在的问题进行预测。假设我们有一个语言模型通过之前的句子用来预测一句话中的最后一个单词是什么。如果我们尝试预测这样一个句子：“the clouds is in the sky”，我们并不需要更多的上下文，很明显下一个单词是sky。在这种情况下我们预测所需要前文的信息很少，RNNs可以被用于训练解决这样的问题。
但是也有一些情况我们需要更多的前文信息。假设我们在预测这样一个句子:“I grew up in France…I speak fluent French.” 离这个空最近的信息表明这个空应该是一种语言，但是我们想要知道是哪种语言。我们需要France的那个语境，然而这个语境在很久很久之前。那么要想获得这个信息我们所需要跨越的上下文区间就非常的大。
很不幸的是，随着这个区间的扩大，RNNs就不再能够关联这些信息了。
理论上，RNNs是有能力来处理这样长时间序列的问题的。一个人可以仔细的为这个模型调参来解决这类的mini问题。但是在实践中，RNNs似乎并不能做到这样。这个问题在Hochreiter(1991)[German] and Bengio,et al.(1994)中做了深入的讨论,他们发现了RNNs之所以不行的根本原因。
令人庆幸的是，LSTM没有这个问题！
长短期记忆网路（long short term memory networks)通常为简称为LSTM，是一种特殊类型的RNN，能够用于训练长期序列问题。他们是由Hochreiter&Schmidhuber(1997)提出的，并且在后续经过了很多人的改进后收到了很多人的欢迎，现在依旧很受欢迎～
LSTM专门就是用来处理长时间序列问题的。记住很长时间的信息是他们的家常便饭，对于他们来说并不是那么困难。
所有的RNNs都有着一种重复链式结构。在标准RNNs中，这种重复的单元有着很简单的结构，比如一个简单的tanh层。
The repeating module in a standard RNN contains a single layer
LSTM 也有着这样的链式结构，但是重复单元有着完全不同的结构。它并不是一个简单的神经网络层，而是有着四个，以特殊方式相互联系的神经网络层。
The repeating module in an LSTM contains four interacting layers
如果看不懂上面的一些细节，请不要担心。接下来我们将会对这个图进行进一步的讲解。现在，我们只需要尝试着搞懂我们即将用的一些符号。
上图中，每一根线都可以传递一个向量，从一个节点的输出到另一个节点的输入。那个粉色的圆圈代表向量运算，比如向量加法。黄色的那个方形代表神经网络层（比如ReLu和tanh）。两条线合并代表着两个向量合并，一条线分为两条代表它的内容被拷贝了一份被送往不同的方向。
LSTM的关键核心在于它的胞体状态，也就是那条横向穿越图片上部的直线。
这个细胞状态就像一种传送带。它横跨整条链，仅仅和其他单元有着一些镜像线性关系。（I am not sure how to translate here .)这为信息的稳定性传递提供了便利。
LSTM的细胞单元拥有移除或者增加信息的能力，准确的来说那块地方被称为门。
门可以选择性的让一些信息通过它。他们由一个sigmod神经元和一个向量运算组成。
sigmod层的输出在0～1之间，表示有多少信息可以通过。0代表没有信息可以通过，1代表所有信息都可以通过。
一个LSTM有三个这样的门单元，用来保护和控制其胞体结构。
LSTM我们要讨论的第一个问题就是：我们应该在胞体中扔掉哪些信息，选择哪些信息。这个问题由一个叫做“遗忘门层”(forget gate layer）的sigmod层来确定。它通过对h_{t1}和x_{t}对计算，然后为细胞状态C_{t1}中的每一个单元（I am not sure here）都选择一个01之间的数。1代表完全保留这个信息，0代表全都不要。
现在让我们回到我们的语言模型，根据之前的的单词来预测下一个单词是什么。在这种问题下，胞体状态可能包含有当前状态的信息，因此可以预测正确的单词。当我们看到一个全新的主题时，我们希望可以忘记之前主题的信息。
下一步就是决定我们的胞体应该储存哪些新的信息。分为两部分，一个被称作输入门层（input gate layer)的simgod层决定我们将会更新哪些值。接下来，tanh层将会建立一个新加入值的向量，C_{t}(头上还有个～，但是我不知道怎么打，后文如果括号里有～则代表这个)，这个C_{t}(~)将会被加入到胞体中。在下一步中，我们将会将这两部分结合起来看看他们是如何更细胞体的。
在语言模型的例子中，我们希望把我们的新主题加入到胞体中，从而使我们的模型忘记旧的主题。
此时我们想要更新细胞状态了，从C_{t1}到C_{t}。之前的内容已经说明了我们要做的内容，现在只需要把上述内容给做一遍就行了。
我们将原来的状态通过f_{t}进行映射，忘记我们之前的内容。然后我们加上i_{t}*
C_{t}(~)，这就是新加入值(new candidate values)，它决定了我们将会更新每一个层的状态范围的大小。
在语言模型中，这个过程就是我们要决定舍弃之前主题信息，添加新主题信息的过程，正如我们在之前做的那样。
最后我们需要决定我们要输出什么。这个输出是基于我们的细胞状态的，但是存在一个过滤的过程。首先我们通过sigmod层决定哪些部分的胞体状态将会输出。然后我们将这些胞体通过tanh(其输出为1～1)并且将他们送入输入sigmod层，如此我们就会只输入我们想要输出的东西。
再次回到我们的语言模型来，因为它只看到了一主语，所以他可能想要输入一个和动词相关的东西，就接下来的单词而言。例如，它可能会输出这个主语是单数还是复数，这样我们将可以知道这个动词的形式是什么。
我们到目前为止描述的LSTM是一个很普通的LSTM。并不是所有的LSTM都和上面描述的一样。事实上，每一篇讨论LSTMs的论文都有着一个不同的变形。这些差异都很相似，但是值得一提。
一个很受欢迎的LSTM的变体，由Gers&Schmidhuber(2000)提出，加入的了“窥视连接”。他的意思就是可以让门层能够观察到胞体的状态。
上图将所有的门都提供了窥视孔(peepholes)，但是有很多论文中只给其中一些门加入监视孔。
另一个变形就是使用两个遗忘门和输入门。他们共同决定遗忘哪些信息或者添加哪些信息而不是相互独立判断的。我们只有在有新的东西要代替他的位置时舍弃他们。我们仅仅在我们要舍弃某些东西的时候输入新的值。
一个更有趣的变式是使用门循环单元（Gate Recurrent Unit，GRU)，Cho,at al.(2014)提出。他将遗忘门和输入门合起来称为更新门（update gate)。它同时合并了细胞状态和隐藏层，以及其他的一些改动。最后的模型是比标准的LSTM模型接单的，但是现在非常受欢迎。
还有一些其他的很流行的LSTM的变式，比如Depth Gated RNNs by Yao,at al.(2015)。同时也有很多处理长短期时间序列的方法，比如clockWork RNN by Koutnik,et al.(2014)
那么这其中哪个最好呢？这些区别有什么用呢？Greff,et al(2015)做了一个很好的比较，找到了他们的相同之处。Jozefowicz, et al.(2015)测试了一万多个模型结构，发现有些结构在某些任务中比LSTM要做的好。
之前，我提到了RNNs一些的很好的效果。而这些LSTM都可以达到，他们的确在某些方面可以做的更好！
看了这么多的等式，LSTM看起来有些吓人。但是幸运的是，这篇文章带你一步一步过了一遍LSTM的过程后，你是不是觉得容易了一点呢？（并没有）
LSTMs的发明是RNNs的一大进步。那么很自然的我们就会问了：有没有其他更大的进步了呢？研究人员的一个普遍观点是：当然有更大的进步！那就是attention。这个观点是让RNN每一步都向前捕获大量的信息。比如你想要用RNNs去描述一张照片，它或许会从图片的一部分来确定输出的内容。事实上，Xu, et al.(2015)做了这个实验–如果你想要了解attention的话，这或许是一个很棒的起点，并且这个东西似乎有很多等待我们发掘的东西。
Attention并不是RNN研究的唯一重点。比如由Kalchbrenner, et al.(2015)提出的Grid LSTMs同样很有前景。使用RNN来产生generative的模型比如Gregor,et al.(2015),Chung, et al.(2015)或者Bayer&Osendorder(2015)看起来也非常有趣。最近几年是RNN的成就十分令人兴奋的，它的未来也十分光明！
(这里我就懒得翻译了，和文章内容无关。。。)I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.
I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.
Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.
]]>