






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
research paper which is related to computer vision
Typology: Essays (university)
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Mingxing Tan 1 Quoc V. Le 1
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we sys- tematically study model scaling and identify that carefully balancing network depth, width, and res- olution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accu- racy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state- of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR- (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https: //github.com/tensorflow/tpu/tree/ master/models/official/efficientnet.
Scaling up ConvNets is widely used to achieve better accu- racy. For example, ResNet (He et al., 2016) can be scaled up from ResNet-18 to ResNet-200 by using more layers; Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima- geNet top-1 accuracy by scaling up a baseline model four time larger. However, the process of scaling up ConvNets (^1) Google Research, Brain Team, Mountain View, CA. Corre- spondence to: Mingxing Tan tanmingxing@google.com.
Preprint, to apear in ICML 2019.
0 20 40 60 80 100 120 140 160 180 Number of Parameters (Millions)
74
76
78
80
82
84
Imagenet Top 1 Accuracy (%)
ResNet-
ResNet-
ResNet- DenseNet-
Inception-v
Inception-ResNet-v
NASNet-A
NASNet-A
ResNeXt-
Xception
AmoebaNet-A AmoebaNet-C SENet
B
B
B
B
B
EfficientNet-B
Top1 Acc. #Params ResNet-152 (He et al., 2016) 77.8% 60M EfficientNet-B1 ResNeXt-101 (Xie et al., 2017) 78.8%80.9% 7.8M84M EfficientNet-B3 81.1% 12M SENet (Hu et al., 2018)NASNet-A (Zoph et al., 2018) 82.7%82.7% 146M89M EfficientNet-B4 82.6% 19M GPipe (Huang et al., 2018) EfficientNet-B7 †^ 84.4%84.3% 556M66M †Not plotted
Figure 1. Model Size vs. ImageNet Accuracy. All numbers are for single-crop, single-model. Our EfficientNets significantly out- perform other ConvNets. In particular, EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 accuracy but being 8.4x smaller and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152. Details are in Table 2 and 4.
has never been well understood and there are currently many ways to do it. The most common way is to scale up Con- vNets by their depth (He et al., 2016) or width (Zagoruyko & Komodakis, 2016). Another less common, but increasingly popular, method is to scale up models by image resolution (Huang et al., 2018). In previous work, it is common to scale only one of the three dimensions – depth, width, and image size. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling requires tedious manual tuning and still often yields sub-optimal accuracy and efficiency. In this paper, we want to study and rethink the process of scaling up ConvNets. In particular, we investigate the central question: is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency? Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surpris- ingly such balance can be achieved by simply scaling each of them with constant ratio. Based on this observation, we propose a simple yet effective compound scaling method. Unlike conventional practice that arbitrary scales these fac- tors, our method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For
(a) baseline (b) width scaling^ (c) depth scaling (d) resolution scaling (e) compound scaling
#channels
layer_i
resolution HxW
wider
deeper
higher resolution higher resolution
deeper
wider
Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.
example, if we want to use 2 N^ times more computational resources, then we can simply increase the network depth by αN^ , width by βN^ , and image size by γN^ , where α, β, γ are constant coefficients determined by a small grid search on the original small model. Figure 2 illustrates the difference between our scaling method and conventional methods.
Intuitively, the compound scaling method makes sense be- cause if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image. In fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018) and empirical results (Zagoruyko & Komodakis, 2016) both show that there exists certain relationship between network width and depth, but to our best knowledge, we are the first to empirically quantify the relationship among all three dimensions of network width, depth, and resolution.
We demonstrate that our scaling method work well on exist- ing MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Notably, the effectiveness of model scaling heavily depends on the baseline network; to go even further, we use neural architecture search (Zoph & Le, 2017; Tan et al., 2019) to develop a new baseline net- work, and scale it up to obtain a family of models, called Effi- cientNets. Figure 1 summarizes the ImageNet performance, where our EfficientNets significantly outperform other Con- vNets. In particular, our EfficientNet-B7 surpasses the best existing GPipe accuracy (Huang et al., 2018), but using 8.4x fewer parameters and running 6.1x faster on inference. Compared to the widely used ResNet (He et al., 2016), our EfficientNet-B4 improves the top-1 accuracy from 76.3% of ResNet-50 to 82.6% with similar FLOPS. Besides Ima- geNet, EfficientNets also transfer well and achieve state-of- the-art accuracy on 5 out of 8 widely used datasets, while
reducing parameters by up to 21x than existing ConvNets.
ConvNet Accuracy: Since AlexNet (Krizhevsky et al.,
ConvNet Efficiency: Deep ConvNets are often over- parameterized. Model compression (Han et al., 2016; He et al., 2018; Yang et al., 2018) is a common way to re- duce model size by trading accuracy for efficiency. As mo- bile phones become ubiquitous, it is also common to hand- craft efficient mobile-size ConvNets, such as SqueezeNets (Iandola et al., 2016; Gholami et al., 2018), MobileNets (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets (Zhang et al., 2018; Ma et al., 2018). Recently, neural archi-
0 2 4 6 8 FLOPS (Billions)
75
76
77
78
79
80
81
ImageNet Top-1 Accuracy(%)
w=1.
w=1.
w=1.
w=2.
w=3.8 w=5.
0 1 2 3 4 FLOPS (Billions)
75
76
77
78
79
80
81
d=1.
d=2.
d=3.0d=4.
d=6.0 d=8.
0 1 2 3 FLOPS (Billions)
75
76
77
78
79
80
81
r=1.
r=1.
r=1.
r=1. r=1.
r=2.2 r=2.
Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) Coefficients. Bigger networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1.
Tan et al., 2019)^2. As discussed in (Zagoruyko & Ko- modakis, 2016), wider networks tend to be able to capture more fine-grained features and are easier to train. However, extremely wide but shallow networks tend to have difficul- ties in capturing higher level features. Our empirical results in Figure 3 (left) show that the accuracy quickly saturates when networks become much wider with larger w.
Resolution (rrr): With higher resolution input images, Con- vNets can potentially capture more fine-grained patterns. Starting from 224x224 in early ConvNets, modern Con- vNets tend to use 299x299 (Szegedy et al., 2016) or 331x (Zoph et al., 2018) for better accuracy. Recently, GPipe (Huang et al., 2018) achieves state-of-the-art ImageNet ac- curacy with 480x480 resolution. Higher resolutions, such as 600x600, are also widely used in object detection ConvNets (He et al., 2017; Lin et al., 2017). Figure 3 (right) shows the results of scaling network resolutions, where indeed higher resolutions improve accuracy, but the accuracy gain dimin- ishes for very high resolutions (r = 1. 0 denotes resolution 224x224 and r = 2. 5 denotes resolution 560x560).
The above analyses lead us to the first observation:
Observation 1 – Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accu- racy gain diminishes for bigger models.
3.3. Compound Scaling
We empirically observe that different scaling dimensions are not independent. Intuitively, for higher resolution images, we should increase network depth, such that the larger re- ceptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network depth when resolution is higher, in
(^2) In some literature, scaling number of channels is called “depth
multiplier”, which means the same as our width coefficient w.
0 5 10 15 20 25 FLOPS (billions)
76
77
78
79
80
81
82
ImageNet Top1 Accuracy (%)
d=1.0, r=1. d=1.0, r=1. d=2.0, r=1. d=2.0, r=1.
Figure 4. Scaling Network Width for Different Baseline Net- works. Each dot in a line denotes a model with different width coefficient (w). All baseline networks are from Table 1. The first baseline network (d=1.0, r=1.0) has 18 convolutional layers with resolution 224x224, while the last baseline (d=2.0, r=1.3) has 36 layers with resolution 299x299.
order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling. To validate our intuitions, we compare width scaling under different network depths and resolutions, as shown in Figure
Observation 2 – In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.
In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) have already tried to arbitrarily balance network width and depth, but they all require tedious manual tuning.
In this paper, we propose a new compound scaling method, which use a compound coefficient φ to uniformly scales network width, depth, and resolution in a principled way:
depth: d = αφ width: w = βφ resolution: r = γφ s.t. α · β^2 · γ^2 ≈ 2 α ≥ 1 , β ≥ 1 , γ ≥ 1
where α, β, γ are constants that can be determined by a small grid search. Intuitively, φ is a user-specified coeffi- cient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network width, depth, and resolution re- spectively. Notably, the FLOPS of a regular convolution op is proportional to d, w^2 , r^2 , i.e., doubling network depth will double FLOPS, but doubling network width or resolu- tion will increase FLOPS by four times. Since convolution ops usually dominate the computation cost in ConvNets, scaling a ConvNet with equation 3 will approximately in-
crease total FLOPS by
α · β^2 · γ^2
)φ
. In this paper, we constraint α · β^2 · γ^2 ≈ 2 such that for any new φ, the total FLOPS will approximately^3 increase by 2 φ.
Since model scaling does not change layer operators Fˆi in baseline network, having a good baseline network is also critical. We will evaluate our scaling method using existing ConvNets, but in order to better demonstrate the effectiveness of our scaling method, we have also developed a new mobile-size baseline, called EfficientNet.
Inspired by (Tan et al., 2019), we develop our baseline net- work by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Specifi- cally, we use the same search space as (Tan et al., 2019), and use ACC(m)×[F LOP S(m)/T ]w^ as the optimization goal, where ACC(m) and F LOP S(m) denote the accu- racy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. Unlike (Tan et al., 2019; Cai et al., 2019), here we optimize FLOPS rather than la- tency since we are not targeting any specific hardware de- vice. Our search produces an efficient network, which we name EfficientNet-B0. Since we use the same search space as (Tan et al., 2019), the architecture is similar to Mnas-
(^3) FLOPS may differ from theocratic value due to rounding.
Table 1. EfficientNet-B0 baseline network – Each row describes a stage i with Lˆi layers, with input resolution 〈 Hˆi, Wˆi〉 and output channels Cˆi. Notations are adopted from equation 2. Stage Operator Resolution #Channels #Layers i Fˆi Hˆi × Wˆi Cˆi Lˆi 1 Conv3x3 224 × 224 32 1 2 MBConv1, k3x3 112 × 112 16 1 3 MBConv6, k3x3 112 × 112 24 2 4 MBConv6, k5x5 56 × 56 40 2 5 MBConv6, k3x3 28 × 28 80 3 6 MBConv6, k5x5 28 × 28 112 3 7 MBConv6, k5x5 14 × 14 192 4 8 MBConv6, k3x3 7 × 7 320 1 9 Conv1x1 & Pooling & FC 7 × 7 1280 1
Net, except our EfficientNet-B0 is slightly bigger due to the larger FLOPS target (our FLOPS target is 400M). Ta- ble 1 shows the architecture of EfficientNet-B0. Its main building block is mobile inverted bottleneck MBConv (San- dler et al., 2018; Tan et al., 2019), to which we also add squeeze-and-excitation optimization (Hu et al., 2018). Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps:
Notably, it is possible to achieve even better performance by searching for α, β, γ directly around a large model, but the search cost becomes prohibitively more expensive on larger models. Our method solves this issue by only doing search once on the small baseline network (step 1), and then use the same scaling coefficients for all other models (step 2).
In this section, we will first evaluate our scaling method on existing ConvNets and the new proposed EfficientNets.
5.1. Scaling Up MobileNets and ResNets As a proof of concept, we first apply our scaling method to the widely-used MobileNets (Howard et al., 2017; San- dler et al., 2018) and ResNet (He et al., 2016). Table 3 shows the ImageNet results of scaling them in different ways. Compared to other single-dimension scaling methods, our compound scaling method improves the accuracy on all these models, suggesting the effectiveness of our proposed scaling method for general existing ConvNets.
Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the- art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.
Comparison to best public-available results Comparison to best reported results Model Acc. #Param Our Model Acc. #Param(ratio) Model Acc. #Param Our Model Acc. #Param(ratio) CIFAR-10 NASNet-A 98.0% 85M EfficientNet-B0 98.1% 4M (21x) †Gpipe 99.0% 556M EfficientNet-B7 98.9% 64M (8.7x) CIFAR-100 NASNet-A 87.5% 85M EfficientNet-B0 88.1% 4M (21x) Gpipe 91.3% 556M EfficientNet-B7 91.7% 64M (8.7x) Birdsnap Inception-v4 81.8% 41M EfficientNet-B5 82.0% 28M (1.5x) GPipe 83.6% 556M EfficientNet-B7 84.3% 64M (8.7x) Stanford Cars Inception-v4 93.4% 41M EfficientNet-B3 93.6% 10M (4.1x) ‡DAT 94.8% - EfficientNet-B7 94.7% - Flowers Inception-v4 98.5% 41M EfficientNet-B5 98.5% 28M (1.5x) DAT 97.7% - EfficientNet-B7 98.8% - FGVC Aircraft Inception-v4 90.9% 41M EfficientNet-B3 90.7% 10M (4.1x) DAT 92.9% - EfficientNet-B7 92.9% - Oxford-IIIT Pets ResNet-152 94.5% 58M EfficientNet-B4 94.8% 17M (5.6x) GPipe 95.9% 556M EfficientNet-B6 95.4% 41M (14x) Food-101 Inception-v4 90.8% 41M EfficientNet-B4 91.5% 17M (2.4x) GPipe 93.0% 556M EfficientNet-B7 93.0% 64M (8.7x) Geo-Mean (4.7x) (9.6x) †GPipe (Huang et al., 2018) trains giant models with specialized pipeline parallelism library. ‡DAT denotes domain adaptive transfer learning (Ngiam et al., 2018). Here we only compare ImageNet-based transfer learning results. Transfer accuracy and #params for NASNet (Zoph et al., 2018), Inception-v4 (Szegedy et al., 2017), ResNet-152 (He et al., 2016) are from (Kornblith et al., 2019).
0 0. 2 0. 4 0. 6 0. 8 1. 0 Number of Parameters (Millions, log-scale)
0
2
4
6
8
0
101 102 103 96
97
98
99
Accuracy(%)
CIFAR
101 102 103
84
86
88
90
92
CIFAR
101 102 103
70
75
80
85
Birdsnap
101 102 103
91
92
93
94
Stanford Cars
101 102 103
0
5
0
5
Accuracy(%)
Flowers
101 102 103
5
0
5
0
5
FGVC Aircraft
101 102 103
92
94
96
Oxford-IIIT Pets
101 102 103
86
88
90
92
Food-
DenseNet- GPIPE Inception-ResNet-v
ResNet- ResNet- DenseNet-
Inception-v Inception-v Inception-v
ResNet- DenseNet-
NASNet-A EfficientNet
Figure 6. Model Parameters vs. Transfer Learning Accuracy – All models are pretrained on ImageNet and finetuned on new datasets.
weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed Au- toAugment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with drop connect ratio 0.3. As com- monly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
Table 2 shows the performance of all EfficientNet models that are scaled from the same baseline EfficientNet-B0. Our EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018).
Figure 1 and Figure 5 illustrates the parameters-accuracy and FLOPS-accuracy curve for representative ConvNets, where our scaled EfficientNet models achieve better accu- racy with much fewer parameters and FLOPS than other ConvNets. Notably, our EfficientNet models are not only small, but also computational cheaper. For example, our EfficientNet-B3 achieves higher accuracy than ResNeXt- 101 (Xie et al., 2017) using 18x fewer FLOPS. To validate the computational cost, we have also measured the inference latency for a few representative CovNets on a real CPU as shown in Table 4, where we report average latency over 20 runs. Our EfficientNet-B1 runs 5.7x faster than the widely used ResNet-152 (He et al., 2016), while EfficientNet-B7 runs about 6.1x faster than GPipe (Huang et al., 2018), suggesting our EfficientNets are indeed fast on real hardware.
bakeshop
original image baseline model deeper (d=4) wider (w=2) higher resolution (r=2) compound scaling
maze
Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Different Models in Table 7 - Our compound scaling method allows the scaled model (last column) to focus on more relevant regions with more object details. Model details are in Table 7.
Table 6. Transfer Learning Datasets.
Dataset Train Size Test Size #Classes CIFAR-10 (Krizhevsky & Hinton, 2009) 50,000 10,000 10 CIFAR-100 (Krizhevsky & Hinton, 2009) 50,000 10,000 100 Birdsnap (Berg et al., 2014) 47,386 2,443 500 Stanford Cars (Krause et al., 2013) 8,144 8,041 196 Flowers (Nilsback & Zisserman, 2008) 2,040 6,149 102 FGVC Aircraft (Maji et al., 2013) 6,667 3,333 100 Oxford-IIIT Pets (Parkhi et al., 2012) 3,680 3,369 37 Food-101 (Bossard et al., 2014) 75,750 25,250 101
5.3. Transfer Learning Results for EfficientNet
We have also evaluated our EfficientNet on a list of com- monly used transfer learning datasets, as shown in Table
Table 5 shows the transfer learning performance: (1) Com- pared to public available models, such as NASNet-A (Zoph et al., 2018) and Inception-v4 (Szegedy et al., 2017), our Ef- ficientNet models achieve better accuracy with 4.7x average (up to 21x) parameter reduction. (2) Compared to state- of-the-art models, including DAT (Ngiam et al., 2018) that dynamically synthesizes training data and GPipe (Huang et al., 2018) that is trained with specialized pipeline paral- lelism, our EfficientNet models still surpass their accuracy in 5 out of 8 datasets, but using 9.6x fewer parameters
Figure 6 compares the accuracy-parameters curve for a va- riety of models. In general, our EfficientNets consistently achieve better accuracy with an order of magnitude fewer pa- rameters than existing models, including ResNet (He et al., 2016), DenseNet (Huang et al., 2017), Inception (Szegedy et al., 2017), and NASNet (Zoph et al., 2018).
To disentangle the contribution of our proposed scaling method from the EfficientNet architecture, Figure 8 com- pares the ImageNet performance of different scaling meth-
0 1 2 3 4 5 FLOPS (Billions)
75
76
77
78
79
80
81
82
83
ImageNet Top-1 Accuracy(%)
scale by width scale by depth scale by resolution compound scaling
Figure 8. Scaling Up EfficientNet-B0 with Different Methods.
Table 7. Scaled Models Used in Figure 7.
Model FLOPS Top-1 Acc. Baseline model (EfficientNet-B0) 0.4B 76.3% Scale model by depth (d=4) 1.8B 79.0% Scale model by width (w=2) 1.8B 78.9% Scale model by resolution (r=2) 1.9B 79.1% Compound Scale (ddd=1.4, www=1.2, rrr=1.3) 1.8B 81.1%
ods for the same EfficientNet-B0 baseline network. In gen- eral, all scaling methods improve accuracy with the cost of more FLOPS, but our compound scaling method can further improve accuracy, by up to 2.5%, than other single- dimension scaling methods, suggesting the importance of our proposed compound scaling. In order to further understand why our compound scaling method is better than others, Figure 7 compares the class activation map (Zhou et al., 2016) for a few representative models with different scaling methods. All these models are scaled from the same baseline, and their statistics are shown in Table 7. Images are randomly picked from ImageNet validation set. As shown in the figure, the model with com- pound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the images.
Lin, T.-Y., Doll ´ar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. CVPR, 2017.
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. ECCV, 2018.
Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres- sive power of neural networks: A view from the width. NeurIPS, 2018.
Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. ECCV, 2018.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Explor- ing the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with spe- cialist models. arXiv preprint arXiv:1811.07056, 2018.
Nilsback, M.-E. and Zisserman, A. Automated flower clas- sification over a large number of classes. ICVGIP, pp. 722–729, 2008.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. CVPR, pp. 3498–3505, 2012.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl- Dickstein, J. On the expressive power of deep neural networks. ICML, 2017.
Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941,
Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu- larized evolution for image classifier architecture search. AAAI, 2019.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition chal- lenge. International Journal of Computer Vision, 115(3): 211–252, 2015.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018.
Sharir, O. and Shashua, A. On the expressive power of overlapping architectures of deep learning. ICLR, 2018.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CVPR, pp. 1–9,
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. CVPR, pp. 2818–2826, 2016.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, 4:12, 2017.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. CVPR, 2019.
Xie, S., Girshick, R., Doll ´ar, P., Tu, Z., and He, K. Aggre- gated residual transformations for deep neural networks. CVPR, pp. 5987–5995, 2017.
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platform-aware neural net- work adaptation for mobile applications. ECCV, 2018.
Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016.
Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit of structural diversity in very deep networks. CVPR, pp. 3900–3908, 2017.
Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An ex- tremely efficient convolutional neural network for mobile devices. CVPR, 2018.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. CVPR, pp. 2921–2929, 2016.
Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ICLR, 2017.
Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018.