This tutorial is created from Lecture 4 from FAST ai Course Deep Learning from coders Course I will go through step by step how to build Classifier using pytorch from scratch.

Image Classification

path = untar_data(URLs.PETS)

Why we are USIGN BASE_PATH

we want to nicely represent our data paths relative to our current path Look at path.ls()

Path.BASE_PATH = path
path.ls()
(#4) [Path('annotations'),Path('images'),Path('models'),Path('crappy')]
(path/"images").ls()
(#7394) [Path('images/Sphynx_245.jpg'),Path('images/miniature_pinscher_55.jpg'),Path('images/havanese_20.jpg'),Path('images/miniature_pinscher_34.jpg'),Path('images/samoyed_91.jpg'),Path('images/chihuahua_123.jpg'),Path('images/yorkshire_terrier_155.jpg'),Path('images/Egyptian_Mau_79.jpg'),Path('images/scottish_terrier_23.jpg'),Path('images/basset_hound_198.jpg')...]

Remember

Most of function we are using in fastai are belong to Class "L" instead of list Ehanced list ( showing number of items , more items are denoted as ".."

Last time first letter is capital then cat otherwise dog

here our case is different

Regular expression help us to get labels

Please google re if you havnt gone through

There is FASTai NLP course a--2 regix lessons

Bit hard to get sometimes

Lets Pick file name and see how it is

fname = (path/"images").ls()[0]
fname.name
'Sphynx_245.jpg'

Little Experiment With RE ( Regular Expression)

  1. re is module
  2. findall grab all parts of regular expression
  3. that have parantheses around them
  4. r is special kind string which says dont treat backslashes\ special remember in python backslashes is newline
  5. 'r(.+)\d.jpg '-- means string pick any "." letter "+" can be repeated one or more time which is followed by under score "" "\d+" followed by digit one or more time ("." --followed by anything can be . ) followed by "jpg" ("dollar" followed by end of string)
re.findall(r'(.+)_\d+.jpg$',fname.name)
['Sphynx']

DataBlock

  1. Now we blocks expect dependent and independent variable
  2. get items --get images files
  3. splitter- Random splitt data
  4. get_y --using attribute which takes Regex LAbler function which will be passed to attribute "name"
  5. aug transform we saw in lesson 2 section aug transformer .. its basically synthetic
  6. Resize to very large image 460 then using aug trans to have smaller size

why?

this is called Presizing

details are below Steps

  1. resize grab square randomly if its portrait then grab randomly full width grab random from top to bottom
  2. secondly augmernt transform resize grab random wraped crop possibly rotated and turn that into square (rotation ,wrapping ,zooming) to smaller to 224 by 224

note : first step turning square. but seccond step can happen in gpu normally things like rotating and cropping are pretty slow

(rotation ,wrapping ,zooming) are actually desruptive to image becasue each one requires interpolation step which not just slow but makes images low quality

whats unique in fast ai

we are keeping track of changing. coordinate values in non-lossy way ,so the full floting point value and then once at very end we will do interpolation

look taddy bears

left - presizing approach right - using python libraries

there are wierd things over here Flaws

  1. less nicely focused
  2. grass
  3. distortion on leg sides

Details Presizing

We need our images to have the same dimensions, so that they can collate into tensors to be passed to the GPU. We also want to minimize the number of distinct augmentation computations we perform. The performance requirement suggests that we should, where possible, compose our augmentation transforms into fewer transforms (to reduce the number of computations and the number of lossy operations) and transform the images into uniform sizes (for more efficient processing on the GPU).

The challenge is that, if performed after resizing down to the augmented size, various common data augmentation transforms might introduce spurious empty zones, degrade data, or both. For instance, rotating an image by 45 degrees fills corner regions of the new bounds with emptiness, which will not teach the model anything. Many rotation and zooming operations will require interpolating to create pixels. These interpolated pixels are derived from the original image data but are still of lower quality.

To work around these challenges, presizing adopts two strategies that are shown in <>:</p>

  1. Resize images to relatively "large" dimensions—that is, dimensions significantly larger than the target training dimensions.
  2. Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.

The first step, the resize, creates images large enough that they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones. This transformation works by resizing to a square, using a large crop size. On the training set, the crop area is chosen randomly, and the size of the crop is selected to cover the entire width or height of the image, whichever is smaller.

In the second step, the GPU is used for all data augmentation, and all of the potentially destructive operations are done together, with a single interpolation at the end.

</div> </div> </div>
"""
<img alt="Presizing on the training set" width="600" caption="Presizing on the training set" id="presizing" src="images/att_00060.png">

"""
'\n<img alt="Presizing on the training set" width="600" caption="Presizing on the training set" id="presizing" src="images/att_00060.png">\n\n'

This picture shows the two steps:

  1. Crop full width or height: This is in item_tfms, so it's applied to each individual image before it is copied to the GPU. It's used to ensure all images are the same size. On the training set, the crop area is chosen randomly. On the validation set, the center square of the image is always chosen.
  2. Random crop and augment: This is in batch_tfms, so it's applied to a batch all at once on the GPU, which means it's fast. On the validation set, only the resize to the final size needed for the model is done here. On the training set, the random crop and any other augmentations are done first.

To implement this process in fastai you use Resize as an item transform with a large size, and RandomResizedCrop as a batch transform with a smaller size. RandomResizedCrop will be added for you if you include the min_scale parameter in your aug_transforms function, as was done in the DataBlock call in the previous section. Alternatively, you can use pad or squish instead of crop (the default) for the initial Resize.

<> shows the difference between an image that has been zoomed, interpolated, rotated, and then interpolated again (which is the approach used by all other deep learning libraries), shown here on the right, and an image that has been zoomed and rotated as one operation and then interpolated just once on the left (the fastai approach), shown here on the left.</p> </div> </div> </div>

pets = DataBlock( blocks =(ImageBlock, CategoryBlock),
                get_items = get_image_files,
                splitter= RandomSplitter(seed=42),
                get_y = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'),'name'),
                item_tfms=Resize(460),
                batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")

Lets Debug DataLoader

show batch is for each mini batch it will show data if loaded properly

dls.show_batch(nrows=1,ncols=3)

Lets Debug Augmentation

get unique = "true"

dls.show_batch(nrows=1,unique=True,ncols=3)
learn = cnn_learner(dls,resnet34, metrics= error_rate)
learn.fine_tune(2)
epoch train_loss valid_loss error_rate time
0 1.530299 0.348531 0.112991 00:42
epoch train_loss valid_loss error_rate time
0 0.494089 0.284470 0.086604 00:52
1 0.339563 0.229090 0.071042 00:53

Model Interpretation

having trained model lets interpret it

using confusion matrix

diagnol shows classified correctly

let see saimese miss classified

when you have lot of classes

tweak to most confused which nummber is bigger

that will tell you confusion which is most confused

interpret = ClassificationInterpretation.from_learner(learn)
interpret.plot_confusion_matrix(figsize =(12,12),dpi=20)

Most Confused

here you can see 'american_pit_bull_terrier', 'staffordshire_bull_terrier' are confused 10 times

I am not a dog and cat expret

Google it : I found they both are similar

when you see when you model make same mistakes like human then its a good sign

interpret.most_confused(min_val=5)
[('american_pit_bull_terrier', 'staffordshire_bull_terrier', 10),
 ('beagle', 'basset_hound', 7),
 ('Bengal', 'Egyptian_Mau', 6),
 ('staffordshire_bull_terrier', 'american_pit_bull_terrier', 5)]

Improve Model

we will try different techniques

  1. Improve learning rate

  2. Unfreezing Transfer learning

Fine Tune Learning rate

Last time if you see we used default 1e-2

unfortunately when we use high learning rate we got high error

lets run and compare it

why this?

see lesson 4 about learning rate jumping

Lislie Research will help us to find learning rate

REmember when we do SGD we look at one mini batch at a time there are images in this case at a time find gradient of taht mini-batch and then did step based on learning rate and gradient

Lislie said lets have first mini batch low learning rate and then make it little increase lets say 25 percent higher and do another step then again 25 higher and go another step 25 percent higher and to another step So there are no epochs there are single and similar minibatch and then we can plot a chart here

Look at plot first we have very samll steps and loss there is not big cahnge in loss with gradually we got the point where they are big enough to make difference the loss came down Until we get to point where learning rate too high

We want a steep point

in our case range ( minimum point/10 , steepest )

Question : Is the leanring rate plot is against one minibatch ?

No its not its just standard walking through datalooders its trainig actually the only change is learning rate tweaking after each minibatch

Question: is network reset after each trial to initail state?

Certainly not we want to see how it change Until we are done.. until we are at the time we ran this so whats we are seeing here is something thats actaul learning happening at the same time increasing learning rate

Question : Why would an ideal learning rate found with single minibatch at start of traning keep being a good learning rate even after several epochs further loss reduction ?

Its absolutely would not

Question : why Steepest?

Because minimum is a point where we wont learn anymore the weights will be oscillating

learn = cnn_learner(dls,resnet34, lr=1e-1,metrics= error_rate)
learn.fine_tune(2)
epoch train_loss valid_loss error_rate time
0 1.559530 0.325268 0.107578 00:44
epoch train_loss valid_loss error_rate time
0 0.509674 0.300865 0.100135 00:54
1 0.338566 0.227000 0.077808 00:54
learn = cnn_learner(dls,resnet34,metrics= error_rate)
lr_min, lr_steep=learn.lr_find(suggest_funcs=(minimum, steep))
print(f'Minimum/10 : {lr_min:.2e} , "steepest point":{lr_steep:.2e}')
Minimum/10 : 8.32e-03 , "steepest point":4.37e-03

Lets Choose Learning Rate

Each time you run it you got different a values so you can pick 3e-3

learning rate finder is on logrithmic scale

now lets experiment

Summary

its just shows

reason took while engineers kind of love using lot of computers they like to run lots of experiments on big clusters to find out which learning rate is the best rather than batch at time and. i think partly

Fast ai is first one having this

learn= cnn_learner(dls,resnet34,metrics=error_rate, lr=3e-3)
learn.fit_one_cycle(3)
epoch train_loss valid_loss error_rate time
0 1.104831 0.339296 0.116373 00:45
1 0.526981 0.215112 0.066982 00:45
2 0.324713 0.208224 0.069689 00:45

Unfreezing

Whats Inside Transfer Learning

Remind ourself what transfer learning does

its bunch of linear model along activation function usually relu

each of those linear layers bunch of parameter after train on image net we have bunch of parameter that are not random anymore we also see early layers learn general features like edges and later layers read sophisticated like eyes etc

We through away last layers because last layers is bit that specifiacally says which one of those belongs to 1000 categories in case of imgaenett ..we replace with random weights sometime more than 1 layer and train that

Discussion Rachel and Howerd:

I think learning rate finder after you learn about it its approximate method would work

Noticed lot of my students at USF have tendency to jump in and try to find where they account of possible imperfection at start. and its very rare..One of cool thing is to try to do easier things

This is big innovation and super helpful and poeple are researcher are ignoring lot of people dont know about learning rate finder

Okay so transfer learning Pretrained network go with architecture..Now we want to fine tune in our case for breed classification

Do this please

learn.fine_tune??

Methods Inside Fine Tune

  1. Freeze : only last layer weightswe step to
  2. freeze_epoch is 1 by default
  3. fit : doing randomly added weights

all layers except last is pretrained

  1. divide learning rate by 1
  2. unfreeze: all parameters now being steps all of them gradient calculated

  3. then we fit for some epochs

we can do it by hand cnn learner by default freeze model for us then we can just call fit one cycle and train last layer

 
learn = cnn_learner(dls,resnet34,metrics = error_rate)
learn.fit_one_cycle(3,3e-3)
epoch train_loss valid_loss error_rate time
0 1.157931 0.296587 0.100812 00:44
1 0.548078 0.263024 0.080514 00:44
2 0.327115 0.235120 0.072395 00:44

Recheck Learning rate

now lets unfreze

run lr_find again you will not see rapid drop

now we are improving 0.05 error

better we can do**

  1. at this point here we training whole model with same learning rate 10 to -5 which doesnt make sense because the last layer is still not that great only had three epochs of training at random. so probably needs more work
  1. we know the second last layer was probably specialised to imagenet instead of pet breed so we need to train it more.... early layerss kind of gradient and edges not need to change much what we like is small learning rate for early layers--- bigger fot the later
learn.unfreeze()
learn.lr_find(suggest_funcs=(minimum, steep))
SuggestedLRs(minimum=1.318256749982538e-07, steep=1.0964781722577754e-06)
learn.fit_one_cycle(6,lr_max=1e-5)
epoch train_loss valid_loss error_rate time
0 0.263316 0.216748 0.070365 00:52
1 0.240422 0.206543 0.067659 00:54
2 0.215897 0.193572 0.061570 00:54
3 0.200803 0.197216 0.068336 00:55
4 0.193693 0.193816 0.062923 00:55
5 0.193264 0.192747 0.062246 00:56

Lets learn about Slice

this is something called discriminative learning rate

JAson Hesinki in his research shows this theory

so we pass slice

e.g slice(1e-6,1e-4)

very first layer 1e-6 --and-- last layer would be 1e-4 layers between the two kind of equal multiples so they will be kind of equally spaced

We overshoot here

Fit One cycle Working

its bit different it is actually at low learning rate increases it gradually for the first 1/3 of batches until it gets to a high learning rate the highest one this is why they are called lr_max

so remaining 2/3 or so batches its gradually decreases... the reason because emperically researcher have found it the best that was developed again by Lieslie Smith the same guy that did learning rate finder.. its dramatically increased the speed to which we can train networks..

Academic community ignored it but infact the key publication that developed the idea was not even the peer review and so the reason i mentioned this is now is to say that we cant we dont really

So we dont just want to go back and pick the model that was trained back here because we could probably do better because we really want to pick a model thats got a low learning rate.

So I want to change epoch to 8 because at epoch 8 we

learn = cnn_learner(dls,resnet34,metrics=error_rate)
learn.fit_one_cycle(3,3e-3)
learn.unfreeze()
learn.fit_one_cycle(12,lr_max=slice(1e-6,1e-4))
epoch train_loss valid_loss error_rate time
0 1.157091 0.375448 0.116373 00:43
1 0.537694 0.227430 0.077808 00:44
2 0.333000 0.212604 0.068336 00:44
epoch train_loss valid_loss error_rate time
0 0.243977 0.204482 0.064953 00:54
1 0.249572 0.199334 0.069689 00:55
2 0.251869 0.192973 0.069012 00:55
3 0.210270 0.182762 0.062246 00:56
4 0.189316 0.180381 0.056834 00:55
5 0.173813 0.183509 0.065629 00:55
6 0.151790 0.176850 0.061570 00:56
7 0.145880 0.178531 0.061570 00:56
8 0.133386 0.174536 0.052774 00:55
9 0.134584 0.179717 0.058187 00:55
10 0.129982 0.174135 0.058863 00:56
11 0.125701 0.175962 0.056157 00:56
</div>