２０．畳み込みニューラルネットワークで遊ぼう - keep learning blog（キープラーニングブログ）

――理論が美しいかどうか、あなたがどれくらい賢いかは大事ではない。事実が伴っていなければ全ては間違いなのだ。――

リチャード・フィリップス・ファインマン

Paper blew the Christmas mood away

Paper blew the Christmas mood away

12月も後半になり、どこもかしこもクリスマス一色。路上では現ナマのクリスマスツリー（モミの木みたいな伐採木）が売られ、すれ違う家族連れもレストランの店員さんたちもどこか浮足立っています。デンマークでは12月24日頃から1月2日までは冬休暇とされており、大学や職場も閑散とします。

そんな折、私の研究室ではクリスマス休暇前の最後のミーティングがあり、教授から「論文の執筆はうまくいってるか？」と耳打ちされて、私のクリスマス気分は跡形もなく吹き飛びました。

そんなわけで、浮ついた世間に背を向け、今回の記事は真面目に機械学習の話です。Kaggleの犬猫画像データを使ってワンちゃんネコちゃん分類器を作ります。加えて、今年の10月に発表された論文の可視化手法を実装して、AIの頭の中（画像のどこに注目して分類を行っているのか）を覗いてみます。

f:id:yuki0718:20191223025702j:plain

畳み込みニューラルネットワーク（CNN）

以前、機械学習の基本part.1～2という2回分の記事にわたってディープニューラルネットワーク（DNN）の基本的な考え方をご紹介しました。

keep-learning.hatenablog.jp

DNNの基礎は（私の拙い説明ではあるものの）記事でご説明したパーセプトロン、誤差逆伝播法、勾配降下法といった数学に集約されており、現行のモデルはいずれもこの延長線上にあります。

現状、DNNの分野では、画像や時系列データなどの学習データに応じて最適化された様々なモデルが考案されています。今回はその最成功例といっても過言ではないDNNモデルの1つ、畳み込みニューラルネットワーク（CNN：Convolutional Neural Network）をご紹介します。

CNNの由来と仕組み

CNNは、画像の学習に特化したDNNの応用モデルであり、自然言語処理（翻訳など）でも成功を収めている画期的手法です。もともとは画像処理分野（Computer vision）に神経科学の考え方を取り入れたところから着想を得ており、人間が画像から物体を認知する神経科学の機構を畳み込みフィルタ（Kernel）やプーリング（Pooling）という数学的手法によってモデル化しています。

ここでいう畳み込み（Convolution）というのは、ある関数 $f(x)$ を平行移動しながら別の関数 $g(x)$ に重ね足し合わせる二項演算を指し、数学の世界では関数同士の合成積

$\displaystyle (f \ast g)(x) \stackrel{\mathrm{def}}{\equiv} \int f(x-y)g(y)dy$

の形で定義されます*1。CNNにおける畳み込み層では、人間の脳神経に存在する単純型細胞をモデル化した畳み込みフィルタ（Kernel）が上式の $f(x)$ 、画像のピクセル分布が $g(x)$ に相当します。フィルタは画像の上を平行移動しながら畳み込み演算を順次実行し、「あ、ここに犬の耳があるぞ」といった局所的な受容野（Receptive field）を構成していきます。

そして、畳み込み層の後にはプーリング（Pooling）層が配置されます。プーリングは、人間の脳神経に存在する複雑型細胞をモデル化したもので、画像から受容野が抽出した局所情報を重要な部分だけに圧縮する手法です。具体的には、画像内を小さな窓に区切り、区切ったそれぞれの窓から最大値または平均値を取って残りは捨てる処理を行います。

この畳み込み層とプーリング層との繰り返し構造によって、画像から特徴量の抽出・圧縮を繰り返し、画像同士を分類できる水準まで到達するというのがCNNのキモです。

本来であれば、以前の記事のようにCNNのモデルを数学的立場から定式化し、偏微分の計算を進めたいところです。しかし、CNNの偏微分計算を説明するのはなかなか大変なので、今回は実装方法を中心に紹介します。

なお、CNNの内部で行われている処理のイメージは、以下のWebページが非常に分かりやすいです。

deepage.net

CNNの優位性

以前の記事で例示したワインの産地分類でも、学習モデルの構成自体より、そこに入力する色や匂い・酸味といった人間の感覚に基づく特徴量（Feature）をうまく数値化できるかどうかが精度に直結するとご説明しました。従来の機械学習では、この特徴量をどう選ぶかというのが人間の裁量に委ねられており、いわば職人芸ともいうべき領域でした。

しかしながら、CNNの登場はその古き良き職人芸を根底から覆しました*2。CNNは、教師データの画像に共通して存在する特徴（人間の顔画像なら目や鼻など）を自動で認識して、特徴量を数値化し、その数値から画像が特定の物体である確率（Score）を導き出して、画像を分類してしまいます。

この技術的優位性、すさまじさは、実際に実装して現実の問題に試してみないことには分かりません。理論の説明はそこそこにして、今回はアプリケーション重視で一気に実装までいきたいと思います。

PythonでCNNの犬猫分類器を作ってみよう

今回やりたいのはKaggle*3という機械学習コミュニティで無料提供されているdogs vs catsデータセット、いわゆる犬猫画像25000枚を使ってCNNを学習し、犬なら1、猫なら0を出力するようなバイナリ分類器を作ることです。

学習データの準備

まずはデータをダウンロードするために、Kaggleのウェブサイトで利用者登録を済ませましょう。

www.kaggle.com

登録してログインすると、上述したウェブページから以下のような画像を25000枚ダウンロードできます。これをdogとcatという分かりやすい名前のフォルダに分けて、Pythonの作業フォルダ以下に格納しておきます。

f:id:yuki0718:20191222033338p:plain f:id:yuki0718:20191222033625p:plain

TensorFlowとKerasの環境構築

今回も実装はPythonで行います。実装と言っても、CNNのテンソル計算を一から自分で書くのは無謀、というか無意味*4なので、ここでは多くのエンジニアが利用しているGoogleのTensorFlowプラットフォーム（現在ver.1.14.0）、特にその上位ラッパーライブラリであるKerasを使ってCNNをプログラミングします。

また、CNNの行列計算は、多くのDNNがそうであるように、私のノートPCに入っているようなCPUプロセッサでは歯が立ちません（時間がかかりすぎる）。高速計算のためには、オンラインゲームなどの画像描画処理に用いられているグラフィックボード、いわゆるGPUと呼ばれるハードウェアが必要です*5。

このGPUでTensorFlowとKerasを動かすには、例のごとく環境構築が必要です。私のようにAnacondaでPythonを導入した方は、ちょっと情報が古いですが以下のウェブサイトが参考になると思います。

dev.infohub.cc

私たち初心者にとっては、この環境構築の方がプログラムを書くこと自体より大変だったりします。ライブラリ関係はバージョン不整合などの問題によってエラーが頻発するので、失敗するとアンインストールしてやり直しなんてこともザラにあります。

その点、Anacondaの場合は元の環境（Base）を破壊することなく機械学習用の仮想環境を作れるので、そこにTensorFlowとKerasをインストールすれば万が一のときもリカバリーできて安心です。

ちなみに余談ですが、最近ではGoogleがGoogle Colaboratoryという無料でクラウド GPUを使えるサービスを提供しているので、インターネットさえ繋がっていればGPUすら要らないかもしれません。すごい時代ですね。

colab.research.google.com

Pythonで実装

いよいよ実装です。いつものことながらソースコードを以下に紹介します。我流なのでコードの素人感がすごいのはご容赦ください。

import matplotlib.pyplot as plt
import numpy as np
import os
import time
import glob
import gc
import h5py
import math
from tensorflow.keras import backend
from tensorflow.keras.models import Model, model_from_json
from tensorflow.keras.layers import Input, Dense, GlobalAveragePooling2D
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img
from datetime import datetime
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split

#■Function for generating batches in training■
def batch_iter(data, labels, batch_size, shuffle=True):
    
    #The number of batches
    num_batches_per_epoch = int((len(data) - 1) / batch_size) + 1
    
    #Generate dataset for each batch
    def data_generator():
        data_size = len(data)
        while True:
            #Shuffle the data at each epoch
            if shuffle:
                shuffle_indices = np.random.permutation(np.arange(data_size))
                shuffled_data = data[shuffle_indices]
                shuffled_labels = labels[shuffle_indices]
            else:
                shuffled_data = data
                shuffled_labels = labels

            for batch_num in range(num_batches_per_epoch):
                start_index = batch_num * batch_size
                end_index = min((batch_num + 1) * batch_size, data_size)
                X = shuffled_data[start_index: end_index]
                y = shuffled_labels[start_index: end_index]
                yield X, y
    
    return num_batches_per_epoch, data_generator()

#■Function to change the learning rate for each epoch■
def step_decay(x):
    y = learn_rate * 10**(-lr_decay*x)
    return y

#■Function for executing CNN learning■
def CNN_learning(train_x, train_y, test_x, test_y, LR, BS, EP, log_path, mode):
    
    #Path for saving CNN model
    p1 = "./log/model.json"
    p2 = "./log/weights.h5"
    
    #In case of existing pre-learned model
    if os.path.isfile(p1) and os.path.isfile(p2) and mode == 1:
        #Read the pre-learned model
        with open(p1, "r") as f:
            cnn_model = model_from_json(f.read())
            cnn_model.load_weights(p2)
        opt = SGD(lr=LR, momentum=0.9, decay=0.0)
        cnn_model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['acc'])
    
    #In case of learning from the beginning
    else:
        #Get the number of row and column in input images
        row = train_x.shape[1]
        column = train_x.shape[2]
        print("input_data_shape: " + str(train_x.shape) )
        
        #Define the input size(row, column, color)
        image_size = Input(shape=(row, column, 3))
        
        #Construct the CNN model with Functional API by Keras
        x = BatchNormalization()(image_size)
        x = Conv2D(32, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = Conv2D(32, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = MaxPooling2D((2, 2), strides=(2, 2))(x)
        x = Conv2D(64, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = Conv2D(64, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = MaxPooling2D((2, 2), strides=(2, 2))(x)
        x = Conv2D(128, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = Conv2D(128, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = Conv2D(128, (3, 3), padding='same', activation="relu")(x)
        x = BatchNormalization()(x)
        x = GlobalAveragePooling2D()(x)
        x = Dense(256, activation='relu', kernel_regularizer=l2(0.01))(x)
        x = BatchNormalization()(x)
        x = Dense(1, activation='sigmoid')(x)
        
        #Construct the model and display summary
        cnn_model = Model(image_size, x)
        print(cnn_model.summary())
        
        #Define the optimizer (SGD with momentum)
        opt = SGD(lr=LR, momentum=0.9, decay=0.0)
        
        #Compile the model
        cnn_model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['acc'])
        
        #Start learning
        lr_decay = LearningRateScheduler(step_decay)
        #hist = cnn_model.fit(train_x, train_y, batch_size=BS, epochs=EP, validation_data=(test_x, test_y), callbacks=[lr_decay], verbose=1)
        train_steps, train_batches = batch_iter(train_x, train_y, BS)
        valid_steps, valid_batches = batch_iter(test_x, test_y, BS)
        hist = cnn_model.fit_generator(train_batches, train_steps, epochs=EP,
                validation_data=valid_batches, validation_steps=valid_steps,
                callbacks=[lr_decay], max_queue_size=5, verbose=1) #max_queue_size is relevant to memory cache
        
        #Save the learned model
        model_json = cnn_model.to_json()
        with open(p1, 'w') as f:
            f.write(model_json)
        cnn_model.save_weights(p2)
        
        #Save the learning history as text file
        loss = hist.history['loss']
        acc = hist.history['acc']
        val_loss = hist.history['val_loss']
        val_acc = hist.history['val_acc']
        with open(log_path, "a") as fp:
            fp.write("epoch\tloss\tacc\tval_loss\tval_acc\n")
            for i in range(len(acc)):
                fp.write("%d\t%f\t%f\t%f\t%f" % (i, loss[i], acc[i], val_loss[i], val_acc[i]))
                fp.write("\n")
        
        #Display the learning history
        plt.rcParams.update({'font.size': 14})
        fig, (axL, axA) = plt.subplots(ncols=2, figsize=(18, 5))
        #Loss function
        axL.plot(hist.history['loss'], label="loss for training")
        axL.plot(hist.history['val_loss'], label="loss for validation")
        axL.set_title('model loss')
        axL.set_xlabel('epoch')
        axL.set_ylabel('loss')
        axL.legend(loc='upper right')
        #Score
        axA.plot(hist.history['acc'], label="accuracy for training")
        axA.plot(hist.history['val_acc'], label="accuracy for validation")
        axA.set_title('model accuracy')
        axA.set_xlabel('epoch')
        axA.set_ylabel('accuracy')
        axA.legend(loc='lower right')
        plt.show()
        #Save the graph
        fig.savefig("./log/loss_accuracy.png")
    
    #Get the score for evaluation data
    proba_y = cnn_model.predict(test_x)
    
    #Restart the session to relieve the GPU memory (to prevent Resource Exhausted Error)
    backend.clear_session()
    #backend.get_session() #less than Tensorflow ver.1.14
    del cnn_model
    gc.collect()

    #Return the learning history and binary score
    return proba_y

#■Function for calculating AUC(Area Under ROC Curve) and its standard error■
def get_AUC(test_y, proba_y):
    
    #Compute the AUC
    AUC = roc_auc_score(test_y, proba_y)
    
    #Compute the AUC standard error[1]
    #[1] J.A.Hanley and B.J.McNeil, Radiology, 1982
    #https://pubs.rsna.org/doi/pdf/10.1148/radiology.143.1.7063747
    N_posi = sum(test_y == 1)
    N_nega = sum(test_y != 1)
    Q1 = AUC / (2 - AUC)
    Q2 = 2 * AUC**2 / (1 + AUC)
    SE = np.sqrt((AUC*(1-AUC) + (N_posi-1)*(Q1-AUC**2) + (N_nega-1)*(Q2-AUC**2)) / (N_posi*N_nega))
    
    #Plot the ROC curve
    plt.rcParams["font.size"] = 16
    plt.figure(figsize=(12, 8))
    fpr, tpr, thresholds = roc_curve(test_y, proba_y)
    plt.plot([0, 1], [0, 1], linestyle='--')
    plt.plot(fpr, tpr, marker='.')
    plt.title('ROC curve')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.show()
    
    #Return AUC
    return AUC, SE

#■Main■
if __name__ == "__main__":
    
    #Set up
    learn_rate = 1e-2      #Lerning rate for CNN training
    lr_decay = 0.1         #Lerning rate is according to "learn_rate*10**(-lr_decay*n_epoch)"
    batch_size = 32        #Size of each batch for CNN training
    epoch = 20             #The number of repeat for CNN training
    IMGmode = 1            #0: convert images into numpy array, 1: read local numpy-files
    CNNmode = 1            #0: train from the beginning, 1: read pre-learned model
    
    #Define a parent path for preserving data
    mydata = "適当なPython作業フォルダ"
    
    #In case of calculating the numpy array from images
    if IMGmode == 0:
        #Initialize variable
        x = []
        y = []
        
        #Read the images of cats
        cat_files = glob.glob(mydata + "/data/training/cat/*.jpg")
        for cat_path in cat_files:
            img = img_to_array(load_img(cat_path, target_size=(150,150)))
            x.append(img)
            y.append(0)
        
        #Read the images of dogs
        dog_files = glob.glob(mydata + "/data/training/dog/*.jpg")
        for dog_path in dog_files:
            img = img_to_array(load_img(dog_path, target_size=(150,150)))
            x.append(img)
            y.append(1)
        
        #Save the training data
        np.save(mydata + '/numpy_data/training/x_dogcat', x)
        np.save(mydata + '/numpy_data/training/y_dogcat', y)
    
    #In case of reading the numpy array from local file
    else:
        x = np.load(mydata + '/numpy_data/training/x_dogcat.npy')
        y = np.load(mydata + '/numpy_data/training/y_dogcat.npy')
    
    #Convert into numpy array
    x = np.array(x) / 255.0
    y = np.array(y)
    
    #Split the data (x, y) into training and evaluation
    train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=1, stratify=y)
    
    #Prepare for process-log
    message = "Training for dogcat-detector\n\n"
    log_path = "./log/" + datetime.now().strftime("%Y%m%d_%H%M%S") + ".txt"
    with open(log_path, "w") as fp:
        fp.write("message")
    
    #Get the start time
    start = time.time()
        
    #Call my function for executing CNN learning
    proba_y = CNN_learning(train_x, train_y, test_x, test_y, learn_rate, batch_size, epoch, log_path, CNNmode)
    
    #Call my function for calculating the AUC
    AUC, SE = get_AUC(test_y, proba_y)
    
    #Output the binary accuracy (Detection Rate)
    pred_y = np.where(proba_y < 0.5, 0, 1) #Binary threshold = 0.5
    ACC = accuracy_score(test_y, pred_y)
    print(classification_report(test_y, pred_y))
    
    #Output the result
    finish = time.time() - start
    report = "AUC={:.3f}, Confidence_interval(95%)=±{:.3f}, Detection_rate={:.3f}, Process_time={:.1f}sec\n".format(AUC, 1.96*SE, ACC, finish)
    message = message + report
    print(report)
    with open(log_path, "a") as fp:
        fp.write(message)

長いので読みにくいかもしれません。ポイントに絞って説明します。

まず、25000枚の犬猫画像（12500枚のワンちゃんと12500枚のネコちゃん）の使い道については、「train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=1, stratify=y)」という行でデータの80%を学習データtrain_xに、20%を試験データtest_xにランダム分割しています。

つまり、25000×80%=20000枚の学習データで一旦CNNをトレーニングして、得られたモデルを5000枚の試験データで何%正解できるか検証しています。

また、CNNは古典的な機械学習モデルに比べてパラメータ数が多いので、これだけ大量の学習データで極端なトレーニングをしてしまうと、学習データには100%正解できるのに、学習中に出てこなかった試験データではほとんど正解できないというポンコツモデルが出来上がってしまいます。

このような学習の失敗を、機械学習の専門用語で過学習（Overfitting）、あるいは汎化性能（Generalization）の低下と呼びます。過学習は性能劣化の主要因であり、多くの場合に学習データの偏りや層構成のイマイチさによって引き起こされます。

そこで、今回のCNNモデルでは、バッチ正規化層（Batch normalization layer）を畳み込み層の前に配置し、畳み込み結果の結合にFlattenではなくGlobal poolingを使うことでパラメータ数を減らし、「Dense(256, activation='relu', kernel_regularizer=l2(0.01))(x)」で出力直前の重み係数にL2-正則化項（L2-regularization term）を導入しています。

これらはいずれもCNNの分野ではデファクトスタンダード化しているテクニックで、うまく使えば経験的（Empirically）にも理論的（Theoretically）にも優れた過学習の抑制効果が示されています。また、CNNの層構成は2014年の画像認識コンペILSVRCで優秀な成績を収めたVGGNetという構造（Architecture）を参考にしています。

詳しくお知りになりたい方は、以下の論文を確認すると理解が深まるかもしれません*6。

[1409.1556] Very Deep Convolutional Networks for Large-Scale Image Recognition
[1806.02375] Understanding Batch Normalization
[1312.4400] Network In Network

以上、過学習防止の手法を駆使してCNNを構築し、20回のEpoch（繰り返し回数）で10分程度学習させたところ、学習データではほぼ100%、試験データでも90%以上の正解率が得られました。学習過程の損失関数（Loss function）と正解率（Accuracy）の推移を以下に示します。

f:id:yuki0718:20191222190432p:plain

まだ収束（Convergence）しきっていない感じがするので、本当はもうちょっとEpochを増やしたり層構成を工夫した方が正解率は上がりそうですが、今回は別に精度を競いたいわけではないので、一発学習して終わりにしました。

CNNモデルを作って終わりではない

以上でCNNのプログラミングの説明は終わりです。任意のワンちゃんネコちゃん画像を与えると、90%以上の画像に対して正しく犬猫分類ができました。それは確かにけっこうなことです。10年前であれば相当努力しないと得られない精度かもしれません。

しかし、今やCNNでモデルを作って犬猫画像やMNISTと呼ばれる手書き数字の画像を分類することは、TensorFlowの力を借りれば誰でも簡単にできてしまいます。つまり、CNNを実装すること自体は何ら目新しいことではなく、週刊少年ジャンプの「友情、努力、勝利」と同じくらい単純明快で、物語の出発点にすぎません。

CNNは凄くてもCNNを使う自分は凄くない

私が7月から研究を始めたとき、教授が私にくれた最初の指導は「流行りに乗せられるな」という警句でした。

どうやら音声認識の分野でも、近年はDNNブームの影響で「DNNを使ったら正解率が向上しました！」「～%以上の精度を達成しました！」といった、良い結果が出ましためでたしめでたし系の論文が散見され、学術論文としての品質が格段に落ちているということでした。

本質的に、科学（Science）とは、「こういうことが起きた」という結果と「なぜそうなったのか」という原因とを、机上の理論やモデルによって結び付けて論理的に説明することを目的とする活動です。

原因と結果の因果が説明されることなく、流行りの手法で得た結果を並べただけの論文は、教授いわく「よしんばEngineeringにはなってもScienceにはなり得ない駄文」だそうです（耳が痛いぜ！）。

教授から英語でまくし立てられて最初はちょっと怖かったですが、私もこの考えには割と納得しています。冒頭のファインマンの言葉にもあるように、科学者は、理論の美しさや自分の賢さを顕示するのではなく、事実（Fact）や証拠（Evidence）に基づく科学的説明に腐心しなければなりません。

今回のケースで言えば、CNNで犬猫分類しただけで満足してはいけないということです。それでは、なぜこのCNNモデルは犬と猫を分類できているのか、何らかの根拠を示すことはできないのでしょうか。

こうした疑問に光を当てる取組みとして、Class Activation Mapping（CAM）という手法があります。その中でも、今回着目したのは、2019年10月（すごく最近）に発表された以下の論文です。論文に従って先ほど得られた学習済みCNNモデルから情報を抜き取れば、CNNが画像のどの部分に着目して犬だと判定しているのかを可視化（Visualization）できるそうです。

arxiv.org

Score-CAMを実装してみよう

論文の内容さえ理解すれば、Score-CAMの実装はとてもシンプルで簡単です。KerasはTensorFlowのラッパーであり、裏側のBackendを呼び出せばテンソル計算の途中経過を抜き取れるため、それを論文どおりに計算すればCAMのマッピングを得ることができます。

ただし、Score-CAMではプーリングで縮小された画像の拡大処理が必要だったので、Pythonの画像処理ライブラリOpenCVをインストールしました。これも環境構築しようとするとPath周りでハマります（私は一回失敗しました）。

今回の用途で必要なOpenCVのインストールは、「pip install opencv-python」というコマンドが確実です。より高機能な（しかし動作保証されていない）アルゴリズムを含む拡張パッケージも入れたい場合には、「pip install opencv-contrib-python」というコマンドを使います。

それではソースコードです。スマートに書けていない気もしますが、それでもかなり短いコードです。

import matplotlib.pyplot as plt
import numpy as np
import os
import time
import glob
import gc
import h5py
import math
import random
from PIL import Image
import cv2
from tensorflow.keras.preprocessing import image as images
from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img
from tensorflow.keras import backend
from tensorflow.keras.models import Model, model_from_json

#■Function for calculating Grad-CAM■
def Grad_CAM(model, x, layer_name):
    
    #Print the binary classification score
    print("Prediction score: " + str(model.predict(x)[0, 0]))
    
    #Get the original image size
    row = model.layers[0].output_shape[0][1]
    column = model.layers[0].output_shape[0][2]
    print("Input_size: " + str((row, column)))
    
    #Function to get final conv_output and gradients
    true_output = model.layers[-1].output #Output of the truely final layer
    mid_output = model.get_layer(layer_name).output  #Output of the final convolutional layer
    grads = backend.gradients(true_output, mid_output)[0]  #Calculate the "gradients(loss, variables)"
    mean_grads = backend.mean(grads, axis=(1, 2))  #Average the gradients
    gradient_function = backend.function([model.input], [mid_output, mean_grads])
    
    #Get the output of final conv_layer and the weight for each kernel (mean gradients)
    conv_output, kernel_weights = gradient_function([x])
    conv_output, kernel_weights = conv_output[0], kernel_weights[0]
    
    #Get the Class Activation Mapping (CAM)
    cam = conv_output @ kernel_weights
    #Caution! cv2-resize-shape is reverse of numpy-shape
    cam = cv2.resize(cam, (column, row), cv2.INTER_LINEAR) #Scale up
    cam = np.maximum(cam, 0) #We have no interest in negative value (like ReLu)
    cam = 255 * cam / cam.max()  #Normalize
    
    #Get together with original image
    original = x[0, :, :, 0][:, :, np.newaxis]  #Cut out the derivatives
    original = np.uint8(255 * original / original.max())
    heatmap = cv2.applyColorMap(np.uint8(cam), cv2.COLORMAP_OCEAN)  #Add color to heat map
    heatmap = cv2.cvtColor(heatmap, cv2.COLOR_BGR2RGB)  #Convert it into color map
    plusCAM = (np.float32(heatmap)*0.4 + original*0.6)  #Mix original image with heatmap
    plusCAM = np.uint8(255 * plusCAM / plusCAM.max())
    
    #Return the CAM data
    return heatmap, plusCAM

#■Function for calculating Score-CAM■
def Score_CAM(model, x, layer_name):
    
    #Print the binary classification score
    print("Prediction score: " + str(model.predict(x)[0, 0]))
    
    #Get the activation map for each kernel (Amap: A^k in original paper)
    Amap_array = Model(inputs=model.input, outputs=model.get_layer(layer_name).output).predict(x)
    
    #Get the original image size
    row = model.layers[0].output_shape[1]
    column = model.layers[0].output_shape[2]
    print("Input_size: " + str((row, column)))
    
    #Scale up the Amap into the original image size
    Amap_upsample_list = []
    for k in range(Amap_array.shape[3]):
        #Caution! cv2-resize-shape is reverse of numpy-shape
        Amap_upsample_list.append(cv2.resize(Amap_array[0,:,:,k], (column, row), cv2.INTER_LINEAR))
    
    #Normalize Amap into the range between 0 and 1
    Amap_norm_list = []
    for Amap_upsample in Amap_upsample_list:
        Amap_norm = Amap_upsample / (np.max(Amap_upsample) - np.min(Amap_upsample) + 1e-5)
        Amap_norm_list.append(Amap_norm)
    
    #Project into original image by multiplying the normalized Amap
    mask_input_list = []
    for Amap_norm in Amap_norm_list:
        mask_input = np.zeros_like(x) #initialize
        for c in range(3):
            mask_input[0,:,:,c] = x[0,:,:,c] * Amap_norm
        mask_input_list.append(mask_input)
    mask_input_array = np.concatenate(mask_input_list, axis=0)
    
    #Get the CNN output by masked input as weight for each kernel (S^k in original paper)
    kernel_weights = model.predict(mask_input_array)[:, 0]
    
    #Get the Class Activation Mapping (CAM)
    cam = Amap_array[0,:,:,:] @ kernel_weights
    #Caution! cv2-resize-shape is reverse of numpy-shape
    cam = cv2.resize(cam, (column, row), cv2.INTER_LINEAR) #Scale up
    cam = np.maximum(cam, 0) #We have no interest in negative value (like ReLu)
    cam = 255 * cam / cam.max() #Normalization
    
    #Get together with original image
    original = x[0, :, :, 0][:, :, np.newaxis]  #Cut out the derivatives
    original = np.uint8(255 * original / original.max())
    heatmap = cv2.applyColorMap(np.uint8(cam), cv2.COLORMAP_OCEAN)  #Add color to heat map
    heatmap = cv2.cvtColor(heatmap, cv2.COLOR_BGR2RGB)  #Convert it into color map
    plusCAM = (np.float32(heatmap)*0.4 + original*0.6)  #Mix original image with heatmap
    plusCAM = np.uint8(255 * plusCAM / plusCAM.max())
    
    #Return the CAM data
    return heatmap, plusCAM

#■Main■
if __name__ == "__main__":
    
    #Load the pre-learned model and its weight
    model_path = "./log/model.json"
    weight_path = "./log/weights.h5"
    if os.path.isfile(model_path) and os.path.isfile(weight_path):
        #Read the pre-learned model
        with open(model_path, "r") as f:
            cnn_model = model_from_json(f.read())
        cnn_model.load_weights(weight_path)
    else:
        print("You input wrong path into p1 and p2.")
        os.sys.exit()
    
    #Extract the audio files randomly
    mydata = "適当なPython作業フォルダ" 
    #fpath = mydata + "evaluation"
    fpath = mydata + "training/dog"
    files = glob.glob(fpath + "/*.jpg")
    list_i = random.sample(list(range(len(files))), k=20) #Extract samples randomly
    
    #Repeat for each image
    for i in list_i:
        
        #Read the image and convert into array
        img = load_img(files[i], target_size=(150,150))
        x = img_to_array(img) / 255.0
        x = x[np.newaxis, :, :, :]
        
        #Draw the image
        plt.rcParams["font.size"] = 14
        plt.figure(figsize=(15, 8), dpi=100)
        plt.subplot(1,3,1)
        plt.title('Original image')
        plt.imshow(img)
        
        #Call for my function to get CAM image
        Grad_heatmap, plusGradCAM = Grad_CAM(cnn_model, x, "conv2d_7")
        Score_heatmap, plusScoreCAM = Score_CAM(cnn_model, x, "conv2d_7")
        
        #Draw the Grad-CAM heatmap
        #img_heatmap = array_to_img(Grad_heatmap)
        img_heatmap = array_to_img(plusGradCAM)
        plt.subplot(1,3,2)
        plt.title('Gradient based CAM')
        plt.imshow(img_heatmap)
        
        #Draw the Score-CAM heatmap
        #img_heatmap = array_to_img(Score_heatmap)
        img_heatmap = array_to_img(plusScoreCAM)
        plt.subplot(1,3,3)
        plt.title('Score based CAM')
        plt.imshow(img_heatmap)
        plt.show()

論文の中では比較例として紹介されていた、旧式のCAMであるGrad-CAMという手法も合わせて実装しています。ランダムにピックアップしたワンちゃんの画像をマッピングした結果は、以下のようになりました。

f:id:yuki0718:20191222202105p:plain f:id:yuki0718:20191222202140p:plain f:id:yuki0718:20191222212202p:plain

左列がCNNに入力する元の画像で、中列はGrad-CAM、右列はScore-CAMのマッピング結果です。青白くなっている場所が、ワンちゃんネコちゃん分類器であるCNNが「あ、これ犬の画像だ！」と感じた着目点、すなわち判断の根拠にしている画像領域です。

一枚目の画像について、旧式のGrad-CAMでは、CNNがまるで犬の鼻だけを見て判断しているような結果になっています。しかし、論文によれば、これはCNNが本当にその部分しか見ていないのではなく、Grad-CAMという手法の欠陥がもたらした不正確さだそうです。

他方、最新の論文のScore-CAMでは、人間が犬を認識するときと同じように、犬の目、鼻、耳や手足といった特徴部分をきちんと捉えていることが分かります。論文の記述どおり、Score-CAMは従来のGrad-CAMよりも正確にCNNの判断根拠を可視化できる手法だといえそうです。

続いて、二枚目と三枚目の画像について、Grad-CAMでは画像中に複数匹の犬がいると片方の犬しか注目していないように見えますが、Score-CAMではきちんと両方の犬を認識して犬の画像かどうかを判断しているように見えます。どちらが私たちの実感に合っているかといえば、やはりScore-CAMの方です。

どちらの手法が優れているかは別にしても、とにかくCNNは誰に教わったわけでもないのに膨大な画像データを学習する過程で「犬」という生き物の特徴が目や鼻にあると理解しているようです。もちろん90%の正解率ですから、理屈の上ではそのくらいできると分かっていても、こうやって可視化されるとなかなかインパクトがあります。

以上、今回はCNNの実装と、ちょっとだけAIの頭の中をのぞき見できる手法Score-CAMの実装をご紹介しました。いずれにしてもGoogleのプラットフォームは強力です。こういったモデルを一から作っていたら日が暮れてしまいますが、TensorFlowの力を借りれば私のような初心者でも簡単にこういうお遊びができてしまう時代です。

稚文をお読みいただきありがとうございました。

*1:畳み込み演算が合成「積」と呼ばれるゆえんは、畳み込まれた関数をフーリエ変換（Fourier transformation）すると、畳み込み前の関数のフーリエ変換同士の掛け算になる、という便利な性質のためです。フーリエ変換は音声やデジタル信号などの処理で頻繁に用いられるため、畳み込み演算は信号処理の分野でなくてはならないエース的存在です。

*2:米国シリコンバレー企業風の表現をするなら、まさに「破壊的イノベーション」と形容するのに相応しいテクノロジーです。

*3:数十万人を超える参加者を擁する機械学習・データサイエンス技術者向けのコミュニティです。企業や政府が開催するコンペを通じて、優れた技術を持つエンジニアと企業・政府をマッチングするための巨大プラットフォームとなっています。

*4:原理を理解するために一度はやるべきでしょう。しかし、プログラムの世界は細かい工夫の積み重ねなので、一から構築することに情熱を注ぐのはいわゆる「車輪の再発明」になりかねません。

*5:デンマーク到着初日になんとなく選んだ研究室のデスクは、その席に昔座っていた修士課程の学生がPCにGPUのボードをセットアップしてくれていた、研究室唯一のGPU入りPCでした。私はハードウェアをいじるのが割と苦手なのでこの偶然に助けられました。

*6:本当にそうなのか？と思う部分もありますが、私はなんとなく理屈に納得しました。