1、使用場景
?隨著業(yè)務(wù)及數(shù)據(jù)量的增長,數(shù)據(jù)庫中的數(shù)據(jù)大致可以分為兩類,一類為操作型數(shù)據(jù),另一類為分析型數(shù)據(jù)。其中,操作型數(shù)據(jù)通常與日常業(yè)務(wù)緊密相關(guān)且可進(jìn)行增刪改查,而分析型數(shù)據(jù)通常為歷史數(shù)據(jù),用于統(tǒng)計分析,僅能查詢不可增刪改。此外,分析型數(shù)據(jù)有時需要對業(yè)務(wù)數(shù)據(jù)進(jìn)行數(shù)據(jù)清洗得到。因此,可以將分析型數(shù)據(jù)導(dǎo)入數(shù)據(jù)倉庫hive中,spark再定時從hive中取出數(shù)據(jù)進(jìn)行分析。以城市空氣質(zhì)量預(yù)測為例,空氣監(jiān)測點分布在城市中的各個地方,定時地將數(shù)據(jù)上傳至平臺中,為了對城市空氣質(zhì)量進(jìn)行預(yù)測,需定期將城市中各監(jiān)測點的小時數(shù)據(jù)取平均值后存入hive中,spark再定期從hive中取出數(shù)據(jù)進(jìn)行預(yù)測分析。
2、spark存入hive
?spark存入hive表有兩種方式,一種調(diào)用方式DF.write.saveAsTable,另一種方式調(diào)用hiveContext.sql將數(shù)據(jù)導(dǎo)入hive中。首先,spark從數(shù)據(jù)庫中讀取原始數(shù)據(jù)并進(jìn)行數(shù)據(jù)清洗,求出城市中所有點的平均值代碼如下:
mpInfoList = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("spark.mongodb.input.uri", MONITOR_POINT_INFO_URL) \
.option("pipeline", matchCity) \
.load().select("ID").rdd.map(lambda x: x.ID).collect()
print(mpInfoList)
airQualityData = spark.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("spark.mongodb.input.uri", INPUT_URL)\
.load()
airQualityData = airQualityData.filter(airQualityData.NodeIdentifier.isin(mpInfoList))
airQualityData = airQualityData.groupBy('ComponentTime')\
.avg('pm25', 'temp', 'press', 'humi', 'wind_speed', 'wind_dir')\
.orderBy(airQualityData.ComponentTime)\
.withColumnRenamed('avg(pm25)', 'pm25')\
.withColumnRenamed('avg(temp)', 'temp')\
.withColumnRenamed('avg(press)', 'press')\
.withColumnRenamed('avg(humi)', 'humi')\
.withColumnRenamed('avg(wind_speed)', 'wind_speed')\
.withColumnRenamed('avg(wind_dir)', 'wind_dir')
隨后,將清洗后的數(shù)據(jù)存入hive中,代碼如下:
data.write.saveAsTable("test.airData", None, "overwrite", None)
3、spark從hive中讀取數(shù)據(jù)
調(diào)用sparkSession.sql從hive中讀取數(shù)據(jù),代碼如下:
data = spark.sql("select * from test.airData")
data1 = data.orderBy(data.ComponentTime)