２７．機械学習でデータをグループ分けしよう - keep learning blog（キープラーニングブログ）

――さらに試行せよ。何が可能かを知るために。――

マイケル・ファラデー

機械学習でデータをグループ分けしよう

機械学習でデータをグループ分けしよう

もうすぐ4月も終わり、私の留学も残り2ヵ月となりました。この国に来た頃は生活の立ち上げや研究計画のことなど、諸々の不安で夜も眠れなかったなあ、と懐かしく思い返します。

本業の研究の方は、別のトピックでプログラムを組み終わったので、これでしばらく実験します。できればもう一本論文を書きたいです。この一年で学部卒くらいのレベルにはなれたかな、という気がします（たったそれだけ？）。

ところで、以前の記事で、混合ガウスモデル（GMM：Gaussian mixture model）という確率論ベースの教師無し機械学習モデルを使って、データをグループ分け（Clustering）する方法をご紹介しました。

keep-learning.hatenablog.jp

GMMは、確率論の導入として情報系の学生が必ず一度は自力で実装するアルゴリズムではありますが、データのグループ分けに使うアルゴリズムには、もっと初歩的な「K-means」や「Fuzzy c-means」と呼ばれる手法があります。

電磁誘導の法則で有名なファラデーが言うように、どんな手法も試行しなければ何ができるのかは分かりません。今回の記事は、これらの手法を現実のデータに適用し、実際にグループ分けを行ってみたいと思います。

f:id:yuki0718:20200427013206j:plain

K-meansアルゴリズム

K-meansアルゴリズムは、教師無し機械学習の分野では最も有名な、ベーシックかつ高速で使い勝手の良いアルゴリズムです。入力した複数のデータ点をk個の近接グループに分割（Hard clustering）します。

基本的にはグループの重心（Centroid）を計算するRefinement stepと、グループの番号（Label）を割り当てるAssignment stepとを何度も繰り返して収束させる、反復型のアルゴリズムです。

今、 $p$ 次元の特徴量ベクトル（項目が $p$ 個あるデータ）からなる入力データ $\boldsymbol{x}_{i}$ をいくつも用意し、これらを $k$ 個のグループ $Label_{i} (\in 1,2,\cdots,k)$ に分けたいとします。

このとき、重心 $\boldsymbol{c}_{j}$ は以下のように計算されます。

$\displaystyle \boldsymbol{c}_{j}=\frac{\displaystyle \sum_{Label_{i}=j} \boldsymbol{x}_{i}}{\displaystyle \sum_{Label_{i}=j} 1}$

数式で書くとなんだか仰々しいですね。この式は単に $j$ 番目のグループに属するデータに限定して平均をとっているだけです。

続いて、ある入力データ $\boldsymbol{x}_{i}$ に対して適切なグループ番号 $Label_{i}=1,2,\cdots,k$ を割り当てるには、入力データと得られた各重心 $\boldsymbol{c}_{j}$ との絶対距離（L2ノルム）について、最小値探索を行います。

$\displaystyle \newcommand{\argmin}{\mathop{\rm arg~min}\limits} Label_{i}=\begin{equation} \argmin_{j} \| \boldsymbol{x}_{i}-\boldsymbol{c}_{j} \| \end{equation}$

以上のステップを交互に繰り返し、以下の損失関数が変化しなくなったら収束したものとみなして終了します。

$\displaystyle Loss=\sum_{j=1}^{k} \sum_{Label_{i}=j} \| \boldsymbol{x}_{i}-\boldsymbol{c}_{j} \|$

K-meansをPythonで実装

以上のとおり、K-meansは直観的にも理解しやすいシンプルなアルゴリズムなので、numpyライブラリ等を使えばPythonで簡単に実装することができます。

import sys
import time
import numpy as np
from scipy import linalg

#Function for checking empty clusters
def check_empty(label, num_clu):
    
    #Get unique label (which must include all number 0~num_clu-1)
    label = np.unique(label)
    
    #Search empty clusters
    emp_ind = []
    for i in range(num_clu):
        if i not in label:
            emp_ind.append(i)
    
    #Output the indices corresponding to the empty clusters
    return emp_ind

#Function for getting k-means clustering
def get_KMeans(X, num_clu, max_iter, num_init):
    
    #Define the length of input
    N = int(X.shape[0])  #The number of data
    p = int(X.shape[1])  #The length of feature axis
    
    #For a progress bar
    unit = int(np.floor(num_init/10))
    bar = "#" + " " * int(np.floor(num_init/unit))
    start = time.time()
    
    #Repeat for each trial (initialization)
    minloss = np.inf
    for init in range(num_init):
        
        #Display a progress bar
        print("\rk-means:[{0}] {1}/{2} Processing..".format(bar, init, num_init), end="")
        if init % unit == 0:
            bar = "#" * int(np.ceil(init/unit)) + " " * int(np.floor((num_init-init)/unit))
            print("\rk-means:[{0}] {1}/{2} Processing..".format(bar, init, num_init), end="")
        
        #Initialize label and centroid as random numbers
        label = np.round((num_clu-1) * np.random.rand(N))
        center = np.random.rand(num_clu, p)
        loss = np.inf
        
        #Repeat for each iteration
        for rep in range(max_iter):
            
            #Reset loss value
            new_loss = 0
            
            ### Refinement step (update center) ###
            for j in range(num_clu):
                
                #You can use the following code for concise implementation
                #center[j, :] = X[label==j].mean(axis=0)
                
                #Construct data matrix for the j-th cluster
                clu_X = []
                for i in range(N):
                    #If the i-th data belongs to the j-th cluster
                    if label[i] == j:
                        clu_X.append(X[i, :])
                clu_X = np.array(clu_X)
                
                #Update the j-th centroid
                center[j, :] = np.mean(clu_X, axis=0)
            
            ### Assignment step (update label) ###
            #Initialize valuable for new label vector
            new_label = np.zeros(N)
            
            for i in range(N):
                
                #You can use the following code for concise implementation
                #dist = np.linalg.norm(X[i, :]-center, axis=1)
                #j = int(np.argsort(dist)[0])
                #new_label[i] = j
                #new_loss = new_loss + dist[j]
                
                #Define the minimum distance
                mindist = np.inf
                
                #Search the closest centroid for the i-th data
                for j in range(num_clu):
                    
                    #Compute the norm (equivalent to sqrt(sum**2))
                    dist = np.linalg.norm(X[i, :]-center[j, :])
                    
                    #Assign the label corresponding to the minimum distance
                    if dist < mindist:
                        #Update minimum distance
                        mindist = dist
                        new_label[i] = j
                    
                #Get summation of the minimum distances
                new_loss = new_loss + mindist
            
            #Call my function for checking empty clusters
            emp_ind = check_empty(new_label, num_clu)
            if len(emp_ind) > 0:
                for ind in emp_ind:
                    #Assign the same index of data as the one of cluster
                    new_label[ind] = ind
            
            #Get out of the loop if loss and label unchange
            if np.abs(loss-new_loss) < 1e-6 and (new_label == label).all():
                #print("The iteration stopped at {}".format(rep+1))
                break
            
            #Update parameters
            label = np.copy(new_label)
            loss = np.copy(new_loss)
            #print("Loss value: {:.3f}".format(loss))
        
        #Output the result corresponding to minimum loss
        if loss < minloss:
            out_label = np.copy(label)
            out_center = np.copy(center)
            minloss = loss
    
    #Finish the progress bar
    bar = "#" * int(np.ceil(num_init/unit))
    print("\rk-means:[{0}] {1}/{2} {3:.2f}sec Completed!".format(bar, init+1, num_init, time.time()-start), end="")
    print()
    
    #Output the label vector and centroid matrix
    return out_label, out_center, minloss

K-meansのコードはもっと短く書くことも可能です。そういったコンパクトな書き方は「#You can use the following code for concise implementation」の下に数行コメントアウトした部分を参考にしてください（結果は完全一致します）。

このコードが長いのは私がアホであるという以外にも理由がありまして、次回の記事で紹介しようと思っているK-meansの進化版アルゴリズムK-shapeと見比べやすくするためです*1。

詳しくは割愛しますが、K-shapeはK-meansが苦手とする時系列データ（株価推移や音声データなど）のグループ分けに特化した、割と最新のアルゴリズムです。K-meansほどスマートに書けないため、このコードのようにfor文を使って愚直に実装します。

Fuzzy c-meansアルゴリズム

Fuzzy c-meansは、1984年にJ.C. Bezdekらによって発表された論文「FCM: The fuzzy c-means clustering algorithm」の中で紹介されました*2。

Fuzzyという名が示すとおり、このアルゴリズムのポイントは曖昧さです。すなわち、K-meansのように各データにユニークなグループ番号を割り当てることはせず、GMMのように確率的な分布（Soft clustering）を仮定します。

つまり、K-meansのように $Label_{i}$ はデータ番号 $i$ だけに依存するベクトルではなく、グループ番号 $j$ とデータ番号 $i$ の両方に依存する行列 $Label_{j,i}$ になり、その行列要素は $i$ 番データが $j$ 番グループに所属する割合（確率）を示すものと解釈します。

また、Fuzzy c-meansでは曖昧さの程度を表すFuzzy係数 $m (>1.0)$ というパラメータも導入されます。 $m=1$ の極限でK-meansに一致し、1から離れるほど曖昧な所属を許容することになります*3。

Fuzzy c-meansもK-meansと同じで、グループの重心（Centroid）を計算するRefinement stepと、グループ番号への所属割合（Label）を割り当てるAssignment stepとを何度も繰り返して収束させる、反復型のアルゴリズムです。

まず、重心 $\boldsymbol{c}_{j}$ は以下のように計算されます。

$\displaystyle \boldsymbol{c}_{j}=\frac{\displaystyle \sum_{i} Label_{j,i}^{m} \boldsymbol{x}_{i}}{\displaystyle \sum_{i} Label_{j,i}^{m}}$

この式は、所属割合 $Label$ の $m$ 乗で重み付けをして全てのデータ点を平均化した、重み付き重心の計算式になっています*4。

続いて、ある入力データ $\boldsymbol{x}_{i}$ に対してグループ番号 $j$ への所属割合 $Label_{j,i}$ を求めるには、論文中に示されている以下の計算式を用います。

$\displaystyle Label_{j,i}=\biggl\{ \sum_{n=1}^{k} \biggl( \frac{\| \boldsymbol{x}_{i}-\boldsymbol{c}_{j} \|}{\| \boldsymbol{x}_{i}-\boldsymbol{c}_{n} \|} \biggr) ^{2/(m-1)} \biggr\}^{-1}$

以上のステップを交互に繰り返し、以下の損失関数が変化しなくなったら収束したものとみなして終了します。

$\displaystyle Loss=\sum_{j=1}^{k} \sum_{i} Label_{j,i}^{m} \| \boldsymbol{x}_{i}-\boldsymbol{c}_{j} \|^{2}$

Fuzzy c-meansをPythonで実装

Fuzzy c-meansはK-meansにグループ所属の曖昧さを持たせた、正当な拡張アルゴリズムです。そのコードの構造は上述した部分を除いて極めて類似しています。こちらもnumpyライブラリ等を使えばPythonで簡単（？）に実装することができます。

import sys
import time
import numpy as np
from scipy import linalg

#Function for getting c-means clustering
def get_fuzzyCMeans(X, num_clu, max_iter, num_init, m):
    
    #Fuzzy coefficient m must be more than 1
    if m <= 1:
        m = 1 + 1e-5
    
    #Define the length of input
    N = int(X.shape[0])  #The number of data
    p = int(X.shape[1])  #The length of feature axis
    
    #For a progress bar
    unit = int(np.floor(num_init/10))
    bar = "#" + " " * int(np.floor(num_init/unit))
    start = time.time()
    
    #Repeat for each trial (initialization)
    minloss = np.inf
    for init in range(num_init):
        
        #Display a progress bar
        print("\rc-means:[{0}] {1}/{2} Processing..".format(bar, init, num_init), end="")
        if init % unit == 0:
            bar = "#" * int(np.ceil(init/unit)) + " " * int(np.floor((num_init-init)/unit))
            print("\rc-means:[{0}] {1}/{2} Processing..".format(bar, init, num_init), end="")
        
        #Initialize label and centroid as random numbers
        label = np.random.rand(num_clu, N)
        label = label / np.sum(label, axis=0, keepdims=True)
        center = np.random.rand(num_clu, p)
        loss = np.inf
        
        #Repeat for each iteration
        for rep in range(max_iter):
            
            ### Refinement step (update center) ###
            #Calculate cluster centroids by using weights matrix (label)
            center = ((label**m) @ X) / np.sum(label**m, axis=1, keepdims=True)
            
            ### Assignment step (update label) ###
            #Initialize valuable for fuzzy-label
            new_label = np.zeros((num_clu, N))
            
            #Compute fuzzy-label matrix
            for i in range(N):
                for clu in range(num_clu):
                    #Compute norm for numerator
                    numer_dist = np.linalg.norm(X[i, :]-center[clu, :])
                    
                    #Get summation of the ratio between norms
                    for j in range(num_clu):
                        #Compute norm for denominator
                        denom_dist = np.linalg.norm(X[i, :]-center[j, :])
                        if denom_dist < 1e-10:
                            denom_dist = 1e-10
                        new_label[clu, i] = new_label[clu, i] + (numer_dist/denom_dist)**(2/(m-1))
            
            #Avoid zero division
            new_label = np.where(new_label < 1e-10, 1e-10, new_label)
            new_label = new_label**(-1)
            
            #Normalization (it is needed due to error handling)
            new_label = new_label / np.sum(new_label, axis=0, keepdims=True)
            
            #Compute the loss function (generalized mean squares error)
            new_loss = 0
            for i in range(N):
                for j in range(num_clu):
                    #Compute the squared norm as distance
                    dist = linalg.norm(X[i, :]-center[j, :])**2
                    new_loss = new_loss + (new_label[j, i]**m) * dist
            
            #Get out of the loop if loss and label unchange
            if np.abs(loss-new_loss) < 1e-6 and (new_label == label).all():
                #print("The iteration stopped at {}".format(rep+1))
                break
            
            #Update parameters
            label = np.copy(new_label)
            loss = np.copy(new_loss)
            #print("Loss value: {:.3f}".format(loss))
        
        #Output the result corresponding to minimum loss
        if loss < minloss:
            out_label = np.copy(label)
            out_center = np.copy(center)
            minloss = loss
    
    #Finish the progress bar
    bar = "#" * int(np.ceil(num_init/unit))
    print("\rc-means:[{0}] {1}/{2} {3:.2f}sec Completed!".format(bar, init+1, num_init, time.time()-start), end="")
    print()
    
    #Output the label vector and centroid matrix
    return out_label, out_center, minloss

アルゴリズムを理解してコードに落とし込むときにありがちなのですが、割り算をするときに分母が0になることを避ける例外処理が必ず必要になります。前節で示したとおりFuzzy c-meansは $Label_{j,i}$ の計算式に割り算が出てくるので要注意です。

この関数から出力されるlabel_outは、行がグループ番号 $j$ 、列がデータ番号 $i$ の行列なので、もしK-meansのように所属するグループを一意に決定したい場合は、np.argmax(label, axis=0)などを利用して確率が最大になる番号を抜き出しましょう。

実際にデータをグループ分けしてみよう

なんだか最近こればっかりで恐縮なのですが、以下のウェブサイトから入手できる各国のCOVID-19感染者データから、感染者全体に占める死亡率と快復率の2項目を入力データ（p=2次元の特徴量ベクトル）に採用し、K-meansやFuzzy c-meansで国を分類してみます。

www.worldometers.info

このデータをExcelでコピペし、「corona_data.csv」というファイル名でPythonのコードと同じ階層に保存しました。以下のメイン関数からK-meansとFuzzy c-meansの関数を呼び出せば、matplotlibライブラリでグループ分けの結果を視覚化してくれます。

import sys
import time
import matplotlib.pyplot as plt
import numpy as np
from scipy import linalg

#Main
if __name__ == "__main__":
    
    #Setup
    num_clu = 6          #The number of cluster [Default]6
    max_iter = 100       #The number of iteration [Default]100
    num_init = 10        #The number of trial (initialization) [Default]10
    demo = False          #Using demo-data (random numbers) [Default]True
    clu_mode = "kmeans"  #Clustering mode (kmeans or cmeans) [Default]kmeans
    m = 2.0              #Using cshape, specify the fuzzy parameter [Default]2.0 (>1.0)
    
    #Define random seed
    np.random.seed(seed=32)
    
    ### Data preparation step ###
    #Generate demo data by using Gaussian
    if demo == True:
        N = 300
        X = np.concatenate([np.random.multivariate_normal([-2, 0], np.eye(2), round(N/num_clu)),
            np.random.multivariate_normal([3, -3], np.eye(2), round(N/num_clu)),
            np.random.multivariate_normal([4, 3], np.eye(2), round(N/num_clu)),
            np.random.multivariate_normal([-1, -4], np.eye(2), round(N/num_clu)),
            np.random.multivariate_normal([-4, -2], np.eye(2), round(N/num_clu)),
            np.random.multivariate_normal([-2, 4], np.eye(2), round(N/num_clu))])
    
    #Using a data file
    else:
        
        #Read a data source (csv file)
        with open("./corona_data.csv", "r") as f:
            s = f.read().rstrip()
        lines = s.split("\n")
        
        #Define analysis axes
        x_axis, y_axis = 2, 3 #between 2 and 5
        #Analysis_axes = ["Country", "Cases", "Deaths", "Recovered", "Serious", "Tests"]
        Analysis_axes = lines[0].split(",")
        lines = lines[1:]
        print("Horizontal axis: " + str(Analysis_axes[x_axis]) + " / Patients")
        print("Vertical axis: " + str(Analysis_axes[y_axis]) + " / Patients")
        
        #Construct the dataset X and the annotation A
        X, A = [], []
        for i, line in enumerate(lines):
            line = line.split(",")
            #Remove the data if missing
            if line[1]!="" and line[4]!="" and line[x_axis]!="" and line[y_axis]!="":
                x_data = int(line[x_axis]) / int(line[1])
                y_data = int(line[y_axis]) / int(line[1])
                #z_data = int(line[4]) / int(line[1])
                X.append([x_data, y_data])
                A.append(line[0])
        X, A = np.array(X), np.array(A)
    
    #Display graph
    print("Input data shape: {}".format(X.shape))
    plt.rcParams["font.size"] = 16
    plt.figure(figsize=(16, 12))
    plt.scatter(X[:, 0], X[:, 1], c="yellow")
    if demo == True:
        plt.title("Demo-data samples (generated randomly)")
    else:
        #Annotation
        for i, a in enumerate(A):
            plt.annotate(a, xy=(X[i, 0], X[i, 1]), size=10)
        plt.title("COVID-19 data for each country")
        plt.xlabel(Analysis_axes[x_axis] + " / Patients")
        plt.ylabel(Analysis_axes[y_axis] + " / Patients")
    plt.show()
    
    ### Clustering step ###
    #Normalize input data (indispensable for clustering)
    ave = np.mean(X, axis=0, keepdims=True)
    X = X - ave
    std = np.std(X, axis=0, keepdims=True)
    X = X / std
    
    #Call my function for getting either k-means or fuzzy c-means clustering
    if clu_mode == "kmeans":
        #Call my function for getting k-means clustering
        label, center, loss = get_KMeans(X, num_clu, max_iter, num_init)
    
    elif clu_mode == "cmeans":
        #Call my function for getting fuzzy c-maens clustering
        fuzzy_label, center, loss = get_fuzzyCMeans(X, num_clu, max_iter, num_init, m)
        #print("Fuzzy label: {}".format(fuzzy_label))
        
        #Harden the fuzzy label
        label = np.argmax(fuzzy_label, axis=0)
    
    else:
        print("'clu_mode' must be either 'kmeans' or 'cmeans'.")
        sys.exit()
    
    print("Label: {}".format(label))
    #print("Centroid: {}".format(center))
    print("Loss: {}".format(loss))
    
    #Restore the original scale
    X = (X * std) + ave
    center = (center * std) + ave
    
    #Display graph
    plt.rcParams["font.size"] = 16
    plt.figure(figsize=(16, 12))
    plt.scatter(X[:, 0], X[:, 1], c=label)
    plt.scatter(center[:, 0], center[:, 1], s=250, marker='*', c='red')
    if clu_mode == "kmeans":
        plt.title("A clustering result by k-means")
    elif clu_mode == "cmeans":
        plt.title("A clustering result by fuzzy c-means")
    if demo == False:
        #Annotation
        for i, a in enumerate(A):
            plt.annotate(a, xy=(X[i, 0], X[i, 1]), size=10)
        plt.xlabel(Analysis_axes[x_axis] + " / Patients")
        plt.ylabel(Analysis_axes[y_axis] + " / Patients")
    plt.show()

いきなり実データを扱う前に、そもそもK-meansとFuzzy c-meansの実装が上手くいっているかどうか確かめるために、demo=Trueにするとデモデータを乱数で生成するようにしました。いくつか試してちゃんとグループ分けできたので、たぶん大丈夫です。

demo=Falseに設定して実行すると、まずはcorona_data.csvから読み込まれたデータから比率計算等を行い、以下のような散布図グラフが生成されます。

f:id:yuki0718:20200427010550p:plain

matplotlibにはannotationというプロパティで1つ1つのデータに名前を付けることができます。今回はデータが国を表しているのでウェブサイトのとおり英語表記で国名が表示されています。

横軸が感染者数に占める死亡者の割合、縦軸が感染者数に占める快復者の割合です。目安ではありますが、右下に行くほど悲惨な被害状況の国、左上に行くほど軽度な被害状況の国を表します*5。

これに対して、グループ数k=6でK-meansとFuzzy c-meansを実行した結果が以下のとおりです。

f:id:yuki0718:20200427010609p:plain f:id:yuki0718:20200427010621p:plain

グループごとにランダムに異なる色を割り振るようにしました。赤色の星は各グループの重心位置を示しています。このくらいシンプルなデータだとK-meansとFuzzy c-meansとであまり差が出ないようです。

大きな被害が報道されているイタリア、スペイン、スウェーデンといった国々は右下のグループにいます。他方、韓国、台湾、ドイツといった対応がうまくいっている国々は左上のグループにいます。日本は途上国が多く所属する左下のグループです。

今回はグラフで視覚化するために死亡率と快復率という2次元の特徴量ベクトルを入力データにしましたが、他にも重症化率などの情報を付け足してベクトルの次元数を大きくすると、また違ったグループ分けになるかもしれません。

なお、以下のURLから入手できるGIA国際調査によるアンケートにおいて、「自国の政府はコロナウイルスにうまく対処していると思う（単数回答）」の項目を見てみると、興味深いことに上のグループ分けと満足度ランキングが全くリンクしていません。

www.nrc.co.jp

例えば、タイは被害が最も小さいグループに属するにも関わらず、アンケートは最下位で、政府のCOVID-19対応に全く満足していないようです。国民の意識は、実情よりも政府に対する元々の信頼度に依存しているのかもしれません。

以上、教師無し機械学習の基本中の基本であるK-meansとFuzzy c-meansについて、COVID-19の実データを用いてグループ分けをしてみました。やっぱり実際のデータを分析するとアルゴリズムの理解が深まってとても勉強になりますね。

次回は時系列データのグループ分けに特化したK-shapeと呼ばれるクラスタリング手法についてご紹介し、これを音源分離に応用してみたいと思います。

稚文をお読みいただきありがとうございました。

*1:他にも、進捗状況を示すプログレスバーを設置したり、特定グループに所属するデータ数がゼロになってしまった場合のエラー処理を工夫したりして、さらに長くなっています。

*2:引用回数4700超えというスマッシュ・ヒットを達成した必読の論文です。

*3:用途にもよりますが、論文を読んでいると、多くの場合Fuzzy係数は1.25～2.00くらいに設定されるようです。

*4:物理で言うと、剛体力学の回転運動を記述する際に便利な慣性モーメントI（Moment of inertia）の計算に似ています。私の記憶が確かなら、質量を回転軸からの距離の2乗で重み付けして和を取ったはずです。

*5:ただし、快復報告についてはかなり国によって差が出ている（サバ読んでる？）のでちょっと信用できないかもしれません。