源碼分析
第一步:準備工作
- SparkContext中創建DAGScheduler、TaskScheduler和SchedulerBackend對象
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()
_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
由些可看在創建SparkContext的時候會調用createTaskScheduler生成SchedulerBackend和TaskScheduler對象
DAGScheduler對象也是在這個時候直接new出來
第二步:提交作業
1.RDD中調用runJob執行作業
步驟1:Rdd#runJob
步驟2:SparkConext#runJob
步驟3:DagScheduler#runJob
步驟4:DagScheduler#submitJob
步驟5:DAGSchedulerEventProcessLoop#post
將Job提交到一個隊列中,等待處理。這是一個典型的生產者消費者模式。這些消息都是通過handleJobSubmitted來處理。步驟6:DAGSchedulerEventProcessLoop#doOnReceive中接收任務(EventLoop的子類)
步驟7:DAGSchedulerEventProcessLoop#handleJobSubmitted 將Job劃分成不同的stage,創建一個activeJob,生成一個任務
步驟8:DAGScheduler#handleJobSubmitted
步驟9-1:DAGScheduler#createResultStage
步驟9-2:DAGScheduler#submitStage
步驟10:DAGScheduler#submitMissingTasks
會完成DAGScheduler最后的工作:它判斷出哪些Partition需要計算,為每個Partition生成Task,然后這些Task就會封閉到TaskSet,最后提交給TaskScheduler進行處理。
以count這個action為例子跟蹤源碼
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
步驟2源碼:SparkConext#runJob
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
步驟3源碼:DagScheduler#runJob
/**
* Run an action job on the given RDD and pass all the results to the resultHandler function as
* they arrive.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like first()
* @param callSite where in the user program this job was called
* @param resultHandler callback to pass each result to
* @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
*
* @note Throws `Exception` when the job fails
*/
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
步驟4源碼:DagScheduler#submitJob
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
- 獲取一個新的jobId
- 生成一個JobWaiter,它會監聽Job的執行狀態,而Job是由多個Task組成的,因此只有當Job的所有Task均已完成,Job才會標記成功
- 最后調用eventProcessLoop.post()將Job提交到一個隊列中,等待處理。這是一個典型的生產者消費者模式。這些消息都是通過handleJobSubmitted來處理。
看看handleJobSubmitted是如何被調用
/**
* The main event loop of the DAG scheduler.
*/
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
DAGSchedulerEventProcessLoop是EventLoop的子類,它重寫了EventLoop的onReceive方法。
doOnReceive會調用handleJobSubmitted。
stage的劃分
handleJobSubmitted會從eventProcessLoop中取出Job來進行處理,處理的第一步就是將Job劃分成不同的stage。handleJobSubmitted主要2個工作,
一是進行stage的劃分;
DAGScheduler代碼
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)
}
/**
* Create a ResultStage associated with the provided jobId.
*/
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
newResultStage()經過多層調用后,最終會調用getParentStages()。
因為是從最終的stage往回推算的,這需要計算最終stage所依賴的各個stage。
二是創建一個activeJob,并生成一個任務。
submitStage(finalStage)
submitStage會提交finalStage,如果這個stage的某些parentStage未提交,則遞歸調用submitStage(),直至所有的stage均已計算完成。
第三步:執行作業
由上面DAGScheduler#submitMissingTasks執行了這個方法后,會執行以下代碼
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
從這里開始往下的執行邏輯:
(1)taskScheduler#submitTasks()
(2) schedulableBuilder#addTaskSetManager()
(3)CoarseGrainedSchedulerBackend#reviveOffers()
(4)CoarseGrainedSchedulerBackend#makeOffers()
(5)TaskSchedulerImpl#resourceOffers
(6)CoarseGrainedSchedulerBackend#launchTasks
(7)executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
步驟一、二中主要將這組任務的TaskSet加入到一個TaskSetManager中。TaskSetManager會根據數據就近原則為task分配計算資源,監控task的執行狀態等,比如失敗重試,推測執行等。
步驟三、四邏輯較為簡單。
步驟五為每個task具體分配資源,它的輸入是一個Executor的列表,輸出是TaskDescription的二維數組。TaskDescription包含了TaskID, Executor ID和task執行的依賴信息等。
步驟六、七就是將任務真正的發送到executor中執行了,并等待executor的狀態返回。