本文基于Spark 1.6.3版本源碼
整體概述
spark的調度模塊可以說是非常有特色的模塊設計,使用DAG(有向無環圖)刻畫spark任務的邏輯關系,將任務切分為多個stage,在每個stage中根據并行度又分為多個task,這多個Task的計算邏輯都一樣,然后把封裝好的task提交給executor執行得出結果。且每個stage之間以及stage內部又存在著依賴關系,通過這些依賴關系構成了lineage,可以提供很好的容錯性。
spark調度模塊中起主導作用的類有三個:DAGScheduler,TaskScheduler,SchedulerBackend
DAGScheduler:被稱為high-level scheduling layer(高階調度層),主要負責根據ShuffleDependency將Job分為多個stage,每個stage中有一組并行的執行相同計算邏輯的Task,將這組Task的元數據封裝成為TaskSets,然后提交給TaskScheduler來執行調度計算。
TaskScheduler:被稱作low-level Task scheduler interface(低階的Task調度接口),主要的實現類為TaskSchedulerImpl,主要負責在接受到DAGScheduler發送來的TaskSets后,將其提交給集群,并在執行期間出現問題時重新提交Tasks,最后將結果events返回給DAGScheduler。
SchedulerBackend:作為TaskScheduler的后臺進程,負責與各種平臺的cluster manager交互,并為Application申請相應的資源,SchedulerBanckend類有多種實現,例如Application如果提交給yarn平臺進行資源的管理調度,則SchedulerBackend對應的實現類為YarnSchedulerBackend,如果是采用Deploy模式,則實現類為SparkDeploySchedulerBackend。
以下源碼分析均是基于Deploy模式,其他模式在SchedulerBackend實現上略有不同,不過其調度原理和實現都是一樣的。
三個重要類實例的初始化及其之間的關系
我們可以從SparkContext的初始化入手來分析以上三個重要類的初始化,當提交Application后,spark會首先初始化SparkContext實例并創建driver,來看一下SparkContext中實例化三個重要類的代碼:
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
其中TaskScheduler和SchedulerBackend是根據傳入的master進行模式匹配得出的,不同的平臺有不同的實現,而DAGScheduler是直接new出來的,且DAGScheduler實例中持有TaskScheduler的引用,這一點可以從DAGScheduler的構造代碼中看出:
def this(sc: SparkContext, taskScheduler: TaskScheduler) = {
this(
sc,
taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env)
}
提交Job
通過上述源碼可知,在Application提交之前,SparkContext實例化的過程中,就已經實例好了_schedulerBackend ,_taskScheduler,_dagScheduler這三個實例,那么接下來,我們通過active操作count方法的代碼來看一下Job是如何提交的:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
runJob方法最終調用的是dagScheduler的runJob方法:
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
在DAGScheduler的runJob方法中,生成了一個JobWaiter實例來監聽Job的執行情況,只有當Job中的所有Task全都成功完成,Job才會被標記成功:
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
//生成一個JobWaiter的實例來監聽Job的執行情況,只有當Job中的所有的Task全都成功完成,Job才會被標記成功
val waiter: JobWaiter[U] = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
在submitJob方法中首先創建了JobWaiter實例,并且通過eventProcessLoop來發送JobSubmitted消息,這個eventProcessLoop使用來監聽DAGScheduler自身的一些消息,在實例化DAGScheduler時創建該實例
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement() //獲取JobId
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
// 生成一個JobWaiter的實例來監聽Job的執行情況,只有當Job中的所有的Task全都成功完成,Job才會被標記成功
val waiter: JobWaiter[U] = new JobWaiter(this, jobId, partitions.size, resultHandler)
// DAGSchedulerEventProcessLoop這個實例的主要職責是調用DAGScheduler的相應方法來處理DAGScheduler發送給他的各種消息,起監督Job的作用
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties))) //DAGScheduler向eventProcessLoop提交該Job,最終調用eventProcessLoop的run方法來處理請求
waiter
}
eventProcessLoop最終調用其doOnReceive方法來處理所有的Event:
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
//如果提交的是一個JobSubmitted的Event,那么調用handleJobSubmitted方法來處理這個請求
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
...
}
到這里,Job就已經提交了,接下來是對Job提交的處理,即DAGScheduler的最主要的功能:劃分stage
劃分stage
我們來看DAGScheduler的handleJobSubmitted方法代碼,其中是如何劃分stage的,我們分為幾段來看
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
// 首先調用newResultStage方法來創建finalStage
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
我們可以看到,DAGShceduler首先創建最后一個stage:finalStage,我們看一看newResultStage方法:
private def newResultStage( //創建最后一個stage的方法
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
//通過調用getParentStagesAndId方法來劃分stage,傳入最后一個RDD和JobId
val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, jobId)
val stage = new ResultStage(id, rdd, func, partitions, parentStages, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
在創建finalStage的時候需要傳入其parentStages,這也是構成DAG調度計劃的一個重要部分,看其實現
private def getParentStagesAndId(rdd: RDD[_], firstJobId: Int): (List[Stage], Int) = {
val parentStages: List[Stage] = getParentStages(rdd, firstJobId) //找到parentStages
val id = nextStageId.getAndIncrement() //nextStageId是一個AtomicInteger,自增1
(parentStages, id) //返回parentStages的序列和對應的Id
}
其中調用了getParentStages方法,在getParentStages中實現了遞歸調用,返回的是Stage的List
private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
val parents = new HashSet[Stage] //parents序列
val visited = new HashSet[RDD[_]] //已經被訪問的RDD
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]] //需要被處理的RDD棧
def visit(r: RDD[_]) {
if (!visited(r)) { //如果棧中的RDD不在被訪問的序列中,則加進去
visited += r
// Kind of ugly: need to register RDDs with the cache here since
// we can't do it in its constructor because # of partitions is unknown
for (dep <- r.dependencies) { //遍歷這個RDD的dependencies
dep match {
case shufDep: ShuffleDependency[_, _, _] => //如果匹配到是ShuffleDependency
parents += getShuffleMapStage(shufDep, firstJobId) //調用getShuffleMapStage方法生成一個stage加入到parents序列中
case _ => //如果是窄依賴將訪問dep對應的RDD壓入待訪問棧(這里的RDD應該是之前一個RDD的父RDD,相當于實現了一個遞歸)
waitingForVisit.push(dep.rdd)
}
}
}
}
waitingForVisit.push(rdd) //將最后一個RDD放入待訪問棧
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop()) //如果需要被處理的RDD棧不為空,則調用visit方法取出里棧中的RDD
}
parents.toList
以上代碼中可以看出,劃分stage的依據是shuffleDependency,以上代碼的精彩之處在于自建了一個待訪問棧:waitingForVisit,通過出棧入棧以及RDD之間的Dependency實現了一個遞歸調用,體現了spark源碼的優雅之處。其中當遇到ShuffleDependency的時候,調用getShuffleMapStage方法創建了新的Stage,我們來看一下這個方法:
private def getShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) => stage //存在就獲取
case None => //不存在就創建
// We are going to register ancestor shuffle dependencies
// 將對應的RDD再調用getAncestorShuffleDependencies方法注冊其祖先的依賴,負責確認這個stage它的parentStage是否已經生成
getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
//拿到還沒有注冊的stage序列遍歷,調用newOrUsedShuffleStage方法注冊到shuffleToMapStage中
shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
}
// Then register current shuffleDep
val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
shuffleToMapStage(shuffleDep.shuffleId) = stage
stage
}
}
以上方法中,維護了一個shuffleToMapStage集合,存有shuffleId和ShuffleMapStage的映射,根據傳入的shuffleDep,如果存在就返回,如果不存在就創建,其中getAncestorShuffleDependencies方法是為了找到那些沒有被注冊到shuffleToMapStage集合的Stage,其中遞歸調用的模樣像極了getParentStages方法,而newOrUsedShuffleStage則是創建shuffle map stage的方法,來看一下newOrUsedShuffleStage
/**
* 根據傳入的Dep對應的RDD創建一個shuffle map stage,這個stage會包含傳入的JobID
* 如果這個stage之前已經存在于MapOutputTracker中,那么會覆蓋
*/
private def newOrUsedShuffleStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length //這個RDD的partitions的數量就是task的數量
val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite) //創建stage
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { //如果mapOutputTracker中已經存在這個shuffleDep
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) //把之前的元數據信息提取出來
val locs = MapOutputTracker.deserializeMapStatuses(serLocs) //修改覆蓋
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else { //如果沒有,就直接注冊進去
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
以上代碼中,首先調用了newShuffleMapStage方法創建了ShuffleMapStage,其次由于是ShuffleMapStage,存在shuffle的過程,會有中間數據落地的過程,所以需要重新注冊修改一下mapOutputTracker,mapOutputTracker是用來管理map端輸出的。其中newShuffleMapStage方法和newResultStage方法如出一轍,首先調用getParentStagesAndId方法獲取parentStage,然后創建ShuffleMapStage實例
private def newShuffleMapStage(
rdd: RDD[_],
numTasks: Int,
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int,
callSite: CallSite): ShuffleMapStage = {
val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, firstJobId)
val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages,
firstJobId, callSite, shuffleDep)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(firstJobId, stage)
stage
}
在方法最后調用updateJobIdStageIdMaps將新建的stage的stageId與JobId聯系起來。
以上這些方法中,我們首先創建了finalStage,然后通過RDD之間的Dependency,采用遞歸調用的方法,找出了這個finalStage的parentStages隊列,并維護到相關的數據結構中。
下面我們來看一下,如何提交上面創建的這些Stages
我們回到handleJobSubmitted,看一下finalStage創建完成后的代碼
// 拿到finalStage之后就可以創建job了
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs() //清空taskLocation的緩存
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job //jobId與job的映射放入集合中
activeJobs += job //job加入activeJobs中
finalStage.setActiveJob(job) //將finalStage的activeJob屬性指定為當前job
val stageIds: Array[Int] = jobIdToStageIds(jobId).toArray //根據jobId取出對應的stageIds
//根據stageIds取出stage的lastestInfo
val stageInfos: Array[StageInfo] = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage) //提交finalStage
submitWaitingStages() //提交waiting隊列的stages
首先創建了Job實例,并維護了相關的數據結構,最后調用submitStage方法并傳入了finalStage,我們來看這個submitStage的具體實現
/** Submits stage, but first recursively submits any missing parents. */
// 提交這個stage,首先遞歸的提交它的missing parents
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage) //拿到stage對應的jobId
if (jobId.isDefined) { //如果不為空
logDebug("submitStage(" + stage + ")")
// 如果這個stage不在waiting、running、failed隊列中
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing: List[Stage] = getMissingParentStages(stage).sortBy(_.id) //找到這個stage的missing parent stages
logDebug("missing: " + missing)
if (missing.isEmpty) { //如果有未提交的parentStages,那么遞歸的提交它的missing parent stages, 最后提交這個stage
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get) //這個方法會完成DAGScheduler最后的工作
} else {
for (parent <- missing) {
submitStage(parent) //這里實現遞歸
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
在這個方法中我們又看到了遞歸調用的精妙之處,對傳入的finalStage,首先確認其有沒有未提交的parentStages,如果有首先提交其parentStage,而當前的Stage就會被放入waitingStages中,通過submitWaitingStages方法來調用,針對每一個提交的Stage調用submitMissingTasks來完成最后的工作
封裝Tasks
通過以上的方法,finalStage以及其parentStages都已經遞歸提交了,通過submitMissingTasks這個方法,我們可以得知提交的Stage都做了什么操作,submitMissingTasks方法代碼較長,首先針對傳入的Stages維護了像runningStages、outputCommitCoordinator等數據結構,我們截選關鍵部分來看:
// 這里取到了Tasks的序列
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.internalAccumulators)
}
case stage: ResultStage =>
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, stage.internalAccumulators)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceS
runningStages -= stage
return
}
這里對傳入的Stages進行模式匹配,如果是ResultStage即finalStage,那么創建ResultTask,如果是ShuffleMapStage ,則創建ShuffleMapTask,接著看下面的代碼:
// 如果tasks序列不為空,那么封裝成TaskSet,走你,接下來看taskScheduler的了
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
}
可以看到,這里將上一步創建的Tasks實例封裝成為TaskSet,然后調用TaskScheduler的submitTasks方法提交給集群,至此DAGScheduler的任務已經圓滿結束,它剩下的工作僅是通過eventProcessLoop來監聽TaskScheduler返回的一些信息,這也是DAGScheduler實例中持有TaskScheduler引用的原因。
下一篇文章中我們繼續分析TaskScheduler在提交Tasks時做了哪些操作,且SchedulerBackend是如何在調度資源的分配上做到公平公正的,敬請期待!