Pysparkでテキストマイニング第一歩 on Amazon EMR with Jupyter notebook

SparkをAmazon EMRに乗せてみたので備忘録的に。

といってもめちゃくちゃ簡単だった。

他にはローカルでSpark立ててしまう、AmazonEC2使うとか選択肢あったけど、
価格見ても高くないし、いつか本格的に使うことを想定して、EMRでの予行演習的な位置づけ。

下記ブログの記事を参考にすれば、
Pyspark＋Jupyter Notebook＋AmazonEMRの環境はほとんど何も悩まずに構築可能かと。

<a href="http://estrellita.hatenablog.com/entry/2015/06/26/213927">Spark on EMR 始めました & やめました - INPUTしたらOUTPUT!</a>estrellita.hatenablog.com

もう少し細かく設定したいなどあれば、Amazon EMR Best Practiceを参考に。

で、せっかくなので、PysparkでHelloWorldな感じでWordsCountやってみました。

Words Count

どのチュートリアルもとりあえずWordsCount的なノリだったので、簡単に手を動かしてみました。
今回テキストに使ったのは、シェークスピアのマクベス。
全くもって読んだことないけど、ちょうど良い感じにテキストが公開されていたので採用。
Cacheの"u"をなんとか消したかったけど、いまいちわからん。
結果に影響ないからとりあえず放置。

他の作品も↓ここからダウンロード可能。
Open Source Shakespeare: search Shakespeare's works, read the texts

とりあえず、どんな感じか見てみる。

#Read text
lines = sc.textFile('Macbeth.txt')
lines.take(10)

[u'[Thunder and lightning. Enter three Witches]',
 u'',
 u'    First Witch. When shall we three meet again',
 u'    In thunder, lightning, or in rain? ',
 u'',
 u"    Second Witch. When the hurlyburly's done,",
 u"    When the battle's lost and won. 5",
 u'',
 u'    Third Witch. That will be ere the set of sun. ',
 u'']

#Count Lines
lines.count()
3485

空白行を削除して改めて。

#Count lines after remove empty lines
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()
2666

文字数えるのに記号が邪魔なのでスペースに変換。
その後、単語ごとに分割。

#Replace symbols to space
a1=lines.map(lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').replace('?',' ').replace('[',' ').replace(']',' ').lower())


#Devide lines to words
a2 = a1.flatMap(lambda x: x.split())
a2.take(10)
[u'thunder',
 u'and',
 u'lightning',
 u'enter',
 u'three',
 u'witches',
 u'first',
 u'witch',
 u'when',
 u'shall']

数え上げるために、それぞれの単語にフラグ立てて、タプルに。
で、各単語ごとに合計計算してあげる。
さら、KeyとValueを入れ替え。

#Create tuple which has unique words
a3 = a2.map(lambda x: (x, 1))

#Sum of each words
a4 = a3.reduceByKey(lambda x,y: x+y)

#Swap the order
a5 = a4.map(lambda x:(x[1],x[0]))
a5.take(10)

[(1, u'limited'),
 (4, u'glamis'),
 (1, u'pardon'),
 (4, u'child'),
 (99, u'all'),
 (5, u'foul'),
 (1, u'sleek'),
 (50, u'hath'),
 (2, u'protest'),
 (1, u'weal;')]

最後に昇順でSortかけました。

#Sort by the number of words
a6 = a5.sortByKey(ascending=False)
a6.take(10)

[(731, u'the'),
 (565, u'and'),
 (398, u'to'),
 (342, u'of'),
 (314, u'i'),
 (267, u'macbeth'),
 (250, u'a'),
 (225, u'that'),
 (205, u'in'),
 (192, u'my')]

せっかくSpark環境できたので、今後もうちょい複雑なことやっていきます。