PySpark學習:WordCount排序
環境:
1、配置好Spark
集群環境
2、配置好Python
環境,在spark
解壓目錄下的python文件夾中 執行python setup.py install
即可安裝好pyspark
庫
3、代碼:
import sys
import os
from operator import add
from pyspark.context import SparkContext
# os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"
if __name__ == "__main__":
sc = SparkContext(appName="WorkCount")
lines = sc.textFile("All along the River.txt")
output = lines.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b) \
.sortBy(lambda x: x[1], False) \
.take(50)
for (word,count) in output:
print("%s : %i" % (word,count))
sc.stop()
注釋:
lines.flatMap(lambda line: line.split(' ')) # 根據空格對一行文本進行分割
.map(lambda word: (word, 1)) # 每個單詞計數1
.reduceByKey(lambda a, b: a+b) # 相同的key的value值相加
.sortBy(lambda x: x[1], False) # 根據value的值進行排序
.take(50) # 取前50個數據
錯誤:
- Exception:Java gateway process exited before sending its port number
添加上這一句:
os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"