2015-09-26

Pythonで実装 PRML 第2章確率分布ノンパラメトリック手法

PRML Python 機械学習

パターン認識と機械学習、第2章からは「2.5.2 ノンパラメトリック手法」を実装します。

取り出してきたデータの数を数え上げて、確率分布を把握する手法が紹介されてます。
この章はあんまり時間かけてもしょうが無いなと感じたので、さくっと実装。
カーネル密度法もスクラッチせず、scipyのパッケージ使ってズルしました。

ヒストグラム・コード

import numpy as np
import matplotlib.pyplot as plt
from pylab import *
from scipy import stats
from scipy.stats.kde import gaussian_kde
import random

def mix_G(x):
    return (0.4 * G1 + 0.6 * G2)

def mix_G_distribution(n):
  ratio = 0.3
  if random.random() <ratio:
    return random.gauss(M1, S1)
  else:
    return random.gauss(M2, S2)

if __name__ == "__main__":
    x = np.linspace(0, 1, 100)

    # Set normal distribution1 
    M1 = 0.3
    S1 = 0.15
    G1 = stats.norm.pdf(x, M1, S1)

    # Set normal distribution1 
    M2 = 0.75
    S2 = 0.1
    G2 = stats.norm.pdf(x, M2, S2)

    N = 50
    Data = [mix_G_distribution(n) for n in range(N)]


    plt.subplot(3, 1, 1)
    plt.hist(Data, bins=1/0.04, normed=True)
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)
    title("Figure 2.24")

    plt.subplot(3, 1, 2)
    plt.hist(Data, bins=1/0.08, normed=True)
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

    plt.subplot(3, 1, 3)
    plt.hist(Data, bins=1/0.25, normed=True)
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

結果

f:id:tkzs:20150918181214p:plain

カーネル密度法・コード

if __name__ == "__main__":
    #Karnel density estimation
    from scipy.stats.kde import gaussian_kde

    plt.subplot(3, 1, 1)
    plt.plot(x, gaussian_kde(Data, 0.005)(x))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)
    title("Figure 2.25")

    plt.subplot(3, 1, 2)
    plt.plot(x, gaussian_kde(Data, 0.07)(x))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

    plt.subplot(3, 1, 3)
    plt.plot(x, gaussian_kde(Data, 0.2)(x))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

結果

f:id:tkzs:20150918181240p:plain

K近傍法・コード

#k_Neighbourhood
def k_NN(test, train, k):
    train = np.array(train)
    train.sort()
    r = []
    for i in test:
        distance = abs(train - i)
        distance.sort()
        r.append(distance[(k-1)])
    r = np.array(r)
    return k / (2* r * N)



if __name__ == "__main__":
    title("Figure 2.26")
    plt.subplot(3, 1, 1)
    plt.plot(x, k_NN(x, Data, 1))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

    plt.subplot(3, 1, 2)
    plt.plot(x, k_NN(x, Data, 10))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

    plt.subplot(3, 1, 3)
    plt.plot(x, k_NN(x, Data, 30))
    plt.plot(x, mix_G(x), "g-")
    plt.xlim(0, 1)
    plt.ylim(0, 3)

結果

f:id:tkzs:20150918181505p:plain

2015-09-25

Pythonで実装 PRML 第1章ベイズ推定

Python 機械学習 PRML

前回に続いて、PRML1章から「1.2.6 ベイズ推定」の実装です。
図1.17の再現ですが、M=9とした上で、予測分布の平均と±1σの範囲を示しています。

数式追ってるだけだと抽象的で、いまいちピンと来なかったんですが、
実装することで、(1.70)、(1.71)が分布を示すことが具体的に可視化されて腹に落ちました。
この予測精度が、前回の多項式曲線フィッティングとどう異なるのかは今回の例からは判断できませんが、
それぞれどのようなものかを理解する上では大変役立ちました。

実装の大まかな流れ

①求めたいのは予測分布(1.69)。
$p(t| x, {\bf x}, {\bf t}) = N(t| m(x), s^2(x)) (1.69)$

②今回、基底関数は $\phi_i(x) = x^i$ for $i = 0,...M$

③予測分布の平均、分散を求める式は下記の3つなので、これをそれぞれ実装して、予測範囲 [tex:0.0

コード

import numpy as np
from numpy.linalg import inv
import pandas as pd
from pylab import *
import matplotlib.pyplot as plt

#From p31 the auther define phi as following
def phi(x):
    return np.array([x**i for i in xrange(M+1)]).reshape((M+1, 1))

#(1.70) Mean of predictive distribution
def m(x, x_train, y_train, S):
    sum = np.array(zeros((M+1, 1)))
    for n in xrange(len(x_train)):
        sum += np.dot(phi(x_train[n]), y_train[n])
    return Beta * phi(x).T.dot(S).dot(sum)

    
#(1.71) Variance of predictive distribution   
def s2(x, S):
    return 1.0/Beta + phi(x).T.dot(S).dot(phi(x))

#(1.72)
def S(x_train, y_train):
    I = np.identity(M+1)
    Sigma = np.zeros((M+1, M+1))
    for n in range(len(x_train)):
        Sigma += np.dot(phi(x_train[n]), phi(x_train[n]).T)
    S_inv = alpha*I + Beta*Sigma
    S = inv(S_inv)
    return S

if __name__ == "__main__":
    alpha = 0.005
    Beta = 11.1
    M = 9
    
    #Sine curve
    x_real = np.arange(0, 1, 0.01)
    y_real = np.sin(2*np.pi*x_real)
    
    ##Training Data
    N=10
    x_train = np.linspace(0, 1, 10)
    
    #Set "small level of random noise having a Gaussian distribution"
    loc = 0
    scale = 0.3
    y_train =  np.sin(2*np.pi*x_train) + np.random.normal(loc,scale,N)
    
    
    S = S(x_train, y_train)
    
    #Seek predictive distribution corespponding to entire x
    mean = [m(x, x_train, y_train, S)[0,0] for x in x_real]
    variance = [s2(x, S)[0,0] for x in x_real]
    SD = np.sqrt(variance)
    upper = mean + SD
    lower = mean - SD
    
    
    plot(x_train, y_train, 'bo')
    plot(x_real, y_real, 'g-')
    plot(x_real, mean, 'r-')
    fill_between(x_real, upper, lower, color='pink')
    xlim(0.0, 1.0)
    ylim(-2, 2)
    title("Figure 1.17")

結果

f:id:tkzs:20150918151632p:plain

2015-09-18

Pythonで実装 PRML 第1章多項式曲線フィッティング

PRML Python 機械学習

パターン認識と機械学習、第1章から「多項式多項式曲線フィッティング」の図の再現に取り組んでみます。
早速図1.4の再現から行ってみたいと思います。また、データの数を増やすことで、予測の正確性が増すことの確認を図1.6で行っています。

数学的にも実装上もまだまだそんなに難しくないので、淡々と実装してみました。
とりあえず、最初の第一歩。

実装の大まかな流れ

①(1.1)を実装。
$y(x, {\bf w}) = w_0 + w_1 x + w_1 x^2 + ...+ w_M x^M = \sum_{j=1}^M w_j x^M (1.1)$

②(1.122)と(1.123)を満たす値を求めると二乗誤差関数(1.2)を最小化するパラメータ ${\bf w}$ の値が求まる。

$E({\bf w}) = \frac{1}{2} \sum_{n=1}^N \{{y({x_n, \bf x_n}) -t_n}^2 \}(1.2)$

$\sum_{j=0}^M {\bf A}_{ij} w_j = {\bf T}_i(1.122)$

${\bf A}_i = \sum_{n=1}^N (x_n)^{i+j} (1.123)$

${\bf T}_i = \sum_{n=1}^N (x_n)^i t_n (1.123)$

コード

import numpy as np
import pandas as pd
from pylab import *
import matplotlib.pyplot as plt
#(1.1)
def y(x, W, M):
    Y = np.array([W[i] * (x ** i) for i in xrange(M+1)])
    return Y.sum()

#(1.2),(1.122),(1.123)
def E(x, t, M):
    A =np.zeros((M+1, M+1))
    for i in range(M+1):
        for j in range(M+1):
            A[i,j] = (x**(i+j)).sum()
        
    T = np.array([((x**i)*t).sum() for i in xrange(M+1)])
    return  np.linalg.solve(A, T)

if __name__ == "__main__":
    #Sine curve
    x_real = np.arange(0, 1, 0.01)
    y_real = np.sin(2*np.pi*x_real)
    
    ##Training Data
    N=10
    x_train = np.arange(0, 1, 0.1)
    
    #Set "small level of random noise having a Gaussian distribution"
    loc = 0
    scale = 0.3
    y_train =  np.sin(2*np.pi*x_train) + np.random.normal(loc,scale,N)
    
    for M in [0,1,3,9]:
        W = E(x_train, y_train, M)
        print W
    
        y_estimate = [y(x, W, M) for x in x_real]
    

        plt.plot(x_real, y_estimate, 'r-')
        plt.plot(x_train, y_train, 'bo') 
        plt.plot(x_real, y_real, 'g-')
        xlim(0.0, 1.0)
        ylim(-2, 2)
        title("Figure 1.4")

結果

f:id:tkzs:20150918132529p:plain

コード

if __name__ == "__main__":
    M2 = 9

    N2 = 20
    x_train2 = np.arange(0, 1, 0.05)
    y_train2 =  np.sin(2*np.pi*x_train2) + np.random.normal(loc,scale,N2)

    N3 = 100
    x_train3 = np.arange(0,1, 0.01)
    y_train3 =  np.sin(2*np.pi*x_train3) + np.random.normal(loc,scale,N3)


    W2 = E(x_train2, y_train2, M2)
    W3 = E(x_train3, y_train3, M2)
    
    y_estimate2 = [y(x, W2, M2) for x in x_real]
    y_estimate3 = [y(x, W3, M2) for x in x_real]

    plt.subplot(1, 2, 1)
    plt.plot(x_real, y_estimate2, 'r-')
    plt.plot(x_train2, y_train2, 'bo') 
    plt.plot(x_real, y_real, 'g-')
    xlim(0.0, 1.0)
    ylim(-2, 2)
    title("Figure 1.6 left")

    plt.subplot(1, 2, 2)
    plt.plot(x_real, y_estimate3, 'r-')
    plt.plot(x_train3, y_train3, 'bo') 
    plt.plot(x_real, y_real, 'g-')
    xlim(0.0, 1.0)
    ylim(-2, 2)
    title("Figure 1.6 right")

結果

f:id:tkzs:20150918132703p:plain

2015-09-14

PuLPを使った線形計画法(Linear Programming)

Python 機械学習統計

カーネルやら何やらを理解するために、非線形計画法を勉強中。
で、その前に線形計画法で実際に手を動かしたくなったので無償で使えるPuLPを使ってみた。

今回例題として解いたのは「これならわかる最適化数学」の第6章、線形計画法。
似たような例題がいくつか並ぶので、ピックアップしてPuLPでモデル化してみた。

【例題6.2】(P161)

2種類の容器A,Bを作るのに, 機械M1,M2を使う. 容器Aを1個作るのにM1を2分, 機械M2を4分使う必要がある.一方,容器Bを1個作るのに、機械M1を8分,機械M2を4分使う必要がある. 容器A,Bを作る利益は一つあたりそれぞれ29円,45円である. 利益を最大にするにはどのように計画すれば良いか.

利益を目的関数f、容器の個数をそれぞれx,yを問いてモデル化すると式は下記のようになる。
制約条件:
2x + 8y <= 60
4x + 4y <= 60
x >= 0, y >= 0

目的関数：
f = 29x + 45y → max

で、これらの式をそのまんまPuLPに突っ込んで解かせるコードが下記。
今回は容器の個数を求めるので、解は整数であり、変数はIntegerで指定する。

#Creat variables
x = pulp.LpVariable("x", cat = "Integer")
y = pulp.LpVariable("y", cat = "Integer")

#Create object function
problem = pulp.LpProblem("Container", pulp.LpMaximize)
problem += 29*x + 45*y  #(6.6)

#constrained condition
problem += 2*x + 8*y <= 60 #(6.4)
problem += 4*x + 4*y <= 60 #(6.4)
problem += x >= 0 #(6.5)
problem += y >= 0 #(6.5)

status = problem.solve()
print "Status", pulp.LpStatus[status]

print problem

print "Result"
print "x", x.value()
print "y", y.value()

結果は、

Status Optimal
Container:
MAXIMIZE
29*x + 45*y + 0
SUBJECT TO
_C1: 2 x + 8 y <= 60
_C2: 4 x + 4 y <= 60
_C3: x >= 0
_C4: y >= 0

VARIABLES
x free Integer
y free Integer

Result
x 10.0
y 5.0

【例題6.3】(P161)

2種類の金属M1,M2を用いて2種類の合金A,Bを作る. 合金A,Bはそれぞれ1トン当たり30,000円、25,000円の利益がある。合金Aは金属M1,M2を1:1の割合で混合し、合金Bは金属M1,M2を1:3の割合で混合する。金属M1,M2はそれぞれ1日当たり10トン,15トンの供給が可能である. 利益を最大にするのに合金A,Bを作るにはどのように計画すればよいか.

モデル化すると、
制約条件:
0.5x + 0.25y <= 10
0.5x + 0.75y <= 15
x >= 0, y >= 0

目的関数：
f = 30x + 25y → max

今回はコード化にあたり、解は連続値であるので、変数はContinuousで指定する。

#Creat variables
x = pulp.LpVariable("x", cat = "Continuous")
y = pulp.LpVariable("y", cat = "Continuous")

#Create object function
problem = pulp.LpProblem("Alloy", pulp.LpMaximize)
problem += 30*x + 25*y  #(6.9)

#constrained condition
problem += 0.5*x + 0.25*y <= 10 #(6.7)
problem += 0.5*x + 0.75*y <= 15 #(6.7)
problem += x >= 0 #(6.8)
problem += y >= 0 #(6.8)

status = problem.solve()
print "Status", pulp.LpStatus[status]

print problem

print "Result"
print "x", x.value()
print "y", y.value()

結果は、

Status Optimal
Alloy:
MAXIMIZE
30*x + 25*y + 0
SUBJECT TO
_C1: 0.5 x + 0.25 y <= 10
_C2: 0.5 x + 0.75 y <= 15
_C3: x >= 0
_C4: y >= 0

VARIABLES
x free Continuous
y free Continuous

Result
x 15.0
y 10.0

【例題6.7】(P164)

今回の例のように目的関数が発散して、最適解が存在しない場合は下記のように、"Status Unbounded"と表示される。

制約条件:

x - y <= -1
2x + y <= 1

x - 2y <= 1
x >= 0, y >= 0

目的関数：
f = x + y → max

#Creat variables
a = pulp.LpVariable("a", cat = "Continuous")
b = pulp.LpVariable("b", cat = "Continuous")

#Create object function
problem = pulp.LpProblem("Test", pulp.LpMaximize)
problem += a + b  #(6.16)

#constrained condition
problem += -a - b <= -2 #(6.14)
problem += -2*a + b <= 15 #(6.14)
problem += a - 2*b <= 15 #(6.14)
problem += a >= 0 #(6.15)
problem += b >= 0 #(6.15)

status = problem.solve()
print "Status", pulp.LpStatus[status]

print problem

Status Unbounded

2015-07-29

ハッシュとカーネル

Spark 機械学習

最近Spark使ってて特徴量減らしたいよねってなった時に、
Future Hashingを漠然と使ってたけど、気になったかた勉強。

Hash Trickの前にKernel Trick

どっか違う空間に写像するってことで似てるけど、若干用途が違うカーネルトリックから。
線形空間では分離できないデータを高次元特徴空間に写像して分離可能にしたいと。けど、次元が増えると計算大変。
じゃあ、高次元に写像した時と同じような結果を導き出せる関数使って、楽に計算しようってのがカーネルトリック。

入力空間 $R^m$ から特徴空間 $F$ への写像 $\Phi$ を、
$\Phi:R^m → F$ , $x → K(・,x)$
とすると、
$\langle \Phi (x), \Phi (y)\rangle = \langle K(・,x), K(・,y)\rangle = K(x,y)$
となって、特徴空間 $F$ への写像した後の内積がカーネル関数 $K()$ を使って入力空間 $R^m$ で楽に計算できる。
このカーネル関数を上手いこと選びましょうってのはまた別の話で。

Hash Trick

で、本題のハッシュトリック。
カーネル関数が線形分離可能にするために高次元に写像することを目指したのに対して、ハッシュ関数はその特徴行列の性質を損なわずに次元を小さくするために使われます。
次元が大きい疎ベクトルを上手いことぎゅっと凝縮して比較的密なベクトルに変換できれば、計算のためのリソースが少なくて済むよねってのがアイデア。

じゃあどうやるかっていうと、7つの特徴カテゴリがある下記のようなデータセットがあるとします。

A1 =[‘mouse’,’black’,-]
A2 =[‘cat’,’tabby’,’mouse’]

で、でもってインデックスの数を4とし、
ハッシュ関数を

H(Animal,’mouse’)=3
H(Color,’black’)=2
H(Animal,’cat’)=0
H(Color,’tabby’)=0
H(Diet,’mouse’)=2

と定義しておくと、
A1は、ハッシュ値1が0個、2が0個、3が1個、4が1個、
同様にA2はハッシュ値1が2個、2が0個、3が1個、4が0個となり、

A1=[0,0,1,1]
A2=[2,0,1,0]

と変換することができます。
ハッシュ値の衝突を避けるために、特徴値の符号を変えたりなどなど工夫がされます。

2015-07-22

PysparkのMllibでリッジ回帰

Python 機械学習 Spark 分散処理

作ったPysparkの環境でMap関数の練習リッジ回帰やってみました。

元ネタはedXのこの講義↓
Scalable Machine Learning | edX

データはMillionSongDatasetから取得できます。

で、結果から先に書きますと、Default値含めてGridsearchしたにも関わらず、
Default値実施した回帰の方が精度が高くなるという鼻血が出そうな結果になってしまってます。

まずはもろもろimport。

#library import
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from test_helper import Test
import numpy as np

#data load
rawData = sc.textFile('millionsong.txt')

リッジ回帰(デフォルト）

使ったのはLinearRegressionWithSGD。

samplePoints = rawData.take(2)


#Create LabeledPoint
def parsePoint(line):

    tokens = line.split(',')
    label = tokens[0]
    features = tokens[1:]
    return LabeledPoint(label, features)

parsedData = rawData.map(parsePoint)
print parsedData.take(3)

[LabeledPoint(2001.0, [0.884123733793,0.610454259079,0.600498416968,0.474669212493,0.247232680947,0.357306088914,0.344136412234,0.339641227335,0.600858840135,0.425704689024,0.60491501652,0.419193351817]), LabeledPoint(2001.0, [0.854411946129,0.604124786151,0.593634078776,0.495885413963,0.266307830936,0.261472105188,0.506387076327,0.464453565511,0.665798573683,0.542968988766,0.58044428577,0.445219373624]), LabeledPoint(2001.0, [0.908982970575,0.632063159227,0.557428975183,0.498263761394,0.276396052336,0.312809861625,0.448530069406,0.448674249968,0.649791323916,0.489868662682,0.591908113534,0.4500023818])]

次にデータをTraning、Validation,Testに分割。

#Devide rawData into Traning, Validation and Test
weights = [.8, .1, .1]
seed = 50
parsedTrainData, parsedValData, parsedTestData = parsedData.randomSplit(weights, seed)

# Fit the model with default values
fitModel = LinearRegressionWithSGD.train(parsedTrainData)
print  fitModel

(weights=[348.391703677,312.158507888,303.514705245,268.768326368,260.265535915,321.082923267,345.636059404,369.96940298,414.587178279,328.611497772,389.972179858,348.42792115], intercept=0.0)

このモデルを元に、実測値と予測値の比較。
誤差は約2年。

# Prediction 
testPoint = parsedTrainData.take(1)[0]
print testPoint.label

testPrediction = fitModel.predict(testPoint.features)
print testPrediction

2001.0
2003.04838193

次に2乗平均平方根誤差(RMSE)を使ったモデル評価を実施。
さらに、RMSEを評価関数としてグリッドサーチでハイパーパラメータを探していきます。

# Define a formula to caluculate RMSE(Root Mean Square Error)
def squaredError(label, prediction):
    ##Calculates the the squared error for a single prediction.
    return (label - prediction)**2

def calcRMSE(labelsAndPreds):
    ##Calculates the root mean squared error for an `RDD` of (label, prediction) tuples.
    return np.sqrt(labelsAndPreds
                   .map(lambda (label, prediction): squaredError(label, prediction))
                   .mean())

#Create new RDD with actual label and predicted label 
labelsAndPreds = parsedValData.map(lambda lp: (lp.label, fitModel.predict(lp.features)))

#Calculation RMSE
rmseValLR1 = calcRMSE(labelsAndPreds)
print rmseValLR1

126.788570325

リッジ回帰(グリッドサーチ編）

デフォルト値で実施したFittingではRMSEは126程度。
でもって、次にグリッドサーチでバリっとハイパーパラメータを探してみます。

##Grid search
# Values to use when training the linear regression model
miniBatchFrac = 1.0  # miniBatchFraction
regType = 'l2'  # regType
useIntercept = True  # intercept

# Seed of minmum RMSE
min_rmseVal = 10**10

#Fix HyperParameters
modelRMSEs = []

for grid_alpha in [1e-5,1.0,10,100]:
    for grid_numIters in [50,100,500]:
        for grid_reg in [1e-5,0.0, 1.0]:
            model = LinearRegressionWithSGD.train(parsedTrainData,
                                                  iterations=grid_numIters, 
                                                  step=grid_alpha,
                                                  regParam=grid_reg)
            
            labelsAndPreds = parsedValData.map(lambda lp: (lp.label, model.predict(lp.features)))
            rmseVal = calcRMSE(labelsAndPreds)
            
            
            
            if rmseVal < min_rmseVal:
                min_rmseVal = rmseVal
                best_alpha = grid_alpha
                best_numIter =grid_numIters
                best_reg = grid_reg
            
print "best_alpha:{},best_numIter:{}, best_reg:{}, best_rmseVal:{}".format(best_alpha, best_numIter, best_reg, min_rmseVal)
best_alpha:1.0,best_numIter:500, best_reg:1e-05, best_rmseVal:117.120806943

こちらはRMSEが117程度と若干下がって、精度が上がっていることが期待出来そう。
で、早速Fittingして、最初のモデルと同様に実測値と予測値の比較。

#Fitting with HyperParameters fixed by grid search 
Final_model = LinearRegressionWithSGD.train(parsedTrainData,
                                           iterations=best_numIter, 
                                           step=best_alpha, 
                                           regParam =best_reg)

#Labels comparison between Final_model and actual data. 
Final_testPoint = parsedTrainData.take(1)[0]
print Final_testPoint.label

Final_Prediction = Final_model.predict(Final_testPoint.features)
print Final_Prediction

2001.0
1969.7210425

誤差30年近くと大幅に精度が悪くなってますね。なんでやろ。
ちょっとすぐには原因わからないので、やり直したら書き直します。

2015-07-17

Pysparkでテキストマイニング第一歩 on Amazon EMR with Jupyter notebook

SparkをAmazon EMRに乗せてみたので備忘録的に。

といってもめちゃくちゃ簡単だった。

他にはローカルでSpark立ててしまう、AmazonEC2使うとか選択肢あったけど、
価格見ても高くないし、いつか本格的に使うことを想定して、EMRでの予行演習的な位置づけ。

下記ブログの記事を参考にすれば、
Pyspark＋Jupyter Notebook＋AmazonEMRの環境はほとんど何も悩まずに構築可能かと。

<a href="http://estrellita.hatenablog.com/entry/2015/06/26/213927">Spark on EMR 始めました & やめました - INPUTしたらOUTPUT!</a>estrellita.hatenablog.com

もう少し細かく設定したいなどあれば、Amazon EMR Best Practiceを参考に。

で、せっかくなので、PysparkでHelloWorldな感じでWordsCountやってみました。

Words Count

どのチュートリアルもとりあえずWordsCount的なノリだったので、簡単に手を動かしてみました。
今回テキストに使ったのは、シェークスピアのマクベス。
全くもって読んだことないけど、ちょうど良い感じにテキストが公開されていたので採用。
Cacheの"u"をなんとか消したかったけど、いまいちわからん。
結果に影響ないからとりあえず放置。

他の作品も↓ここからダウンロード可能。
Open Source Shakespeare: search Shakespeare's works, read the texts

とりあえず、どんな感じか見てみる。

#Read text
lines = sc.textFile('Macbeth.txt')
lines.take(10)

[u'[Thunder and lightning. Enter three Witches]',
 u'',
 u'    First Witch. When shall we three meet again',
 u'    In thunder, lightning, or in rain? ',
 u'',
 u"    Second Witch. When the hurlyburly's done,",
 u"    When the battle's lost and won. 5",
 u'',
 u'    Third Witch. That will be ere the set of sun. ',
 u'']

#Count Lines
lines.count()
3485

空白行を削除して改めて。

#Count lines after remove empty lines
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()
2666

文字数えるのに記号が邪魔なのでスペースに変換。
その後、単語ごとに分割。

#Replace symbols to space
a1=lines.map(lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').replace('?',' ').replace('[',' ').replace(']',' ').lower())


#Devide lines to words
a2 = a1.flatMap(lambda x: x.split())
a2.take(10)
[u'thunder',
 u'and',
 u'lightning',
 u'enter',
 u'three',
 u'witches',
 u'first',
 u'witch',
 u'when',
 u'shall']

数え上げるために、それぞれの単語にフラグ立てて、タプルに。
で、各単語ごとに合計計算してあげる。
さら、KeyとValueを入れ替え。

#Create tuple which has unique words
a3 = a2.map(lambda x: (x, 1))

#Sum of each words
a4 = a3.reduceByKey(lambda x,y: x+y)

#Swap the order
a5 = a4.map(lambda x:(x[1],x[0]))
a5.take(10)

[(1, u'limited'),
 (4, u'glamis'),
 (1, u'pardon'),
 (4, u'child'),
 (99, u'all'),
 (5, u'foul'),
 (1, u'sleek'),
 (50, u'hath'),
 (2, u'protest'),
 (1, u'weal;')]

最後に昇順でSortかけました。

#Sort by the number of words
a6 = a5.sortByKey(ascending=False)
a6.take(10)

[(731, u'the'),
 (565, u'and'),
 (398, u'to'),
 (342, u'of'),
 (314, u'i'),
 (267, u'macbeth'),
 (250, u'a'),
 (225, u'that'),
 (205, u'in'),
 (192, u'my')]

せっかくSpark環境できたので、今後もうちょい複雑なことやっていきます。