歡迎關注公眾號“Tim在路上”
在Spark中shuffleWriter有三種實現,分別是bypassMergeSortShuffleWriter, UnsafeShuffleWriter和SortShuffleWriter。但是shuffleReader卻只有一種實現BlockStoreShuffleReader
。
從上一講中可以知道,這時Spark已經獲取到了shuffle元數據包括每個mapId和其location信息,并將其傳遞給BlockStoreShuffleReader類。接下來我們來詳細分析下BlockStoreShuffleReader的實現。
// BlockStoreShuffleReader
override def read(): Iterator[Product2[K, C]] = {
// [1] 初始化ShuffleBlockFetcherIterator,負責從executor中獲取 shuffle 塊
val wrappedStreams = new ShuffleBlockFetcherIterator(
context,
blockManager.blockStoreClient,
blockManager,
mapOutputTracker,
blocksByAddress,
...
readMetrics,
fetchContinuousBlocksInBatch).toCompletionIterator
val serializerInstance = dep.serializer.newInstance()
// [2] 將shuffle 塊反序列化為record迭代器
// Create a key/value iterator for each stream
val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
// Note: the asKeyValueIterator below wraps a key/value iterator inside of a
// NextIterator. The NextIterator makes sure that close() is called on the
// underlying InputStream when all records have been read.
serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
}
// Update the context task metrics for each record read.
val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
recordIter.map { record =>
readMetrics.incRecordsRead(1)
record
},
context.taskMetrics().mergeShuffleReadMetrics())
// An interruptible iterator must be used here in order to support task cancellation
val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
// [3] reduce端聚合數據:如果map端已經聚合過了,則對讀取到的聚合結果進行聚合。如果map端沒有聚合,則針對未合并的<k,v>進行聚合。
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {
// We are reading values that are already combined
val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
} else {
// We don't know the value type, but also don't care -- the dependency *should*
// have made sure its compatible w/ this aggregator, which will convert the value
// type to the combined type C
val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
}
} else {
interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
}
// [4] reduce端排序數據:如果需要對key排序,則進行排序。基于sort的shuffle實現過程中,默認只是按照partitionId排序。在每一個partition內部并沒有排序,因此添加了keyOrdering變量,提供是否需要對分區內部的key排序
// Sort the output if there is a sort ordering defined.
val resultIter: Iterator[Product2[K, C]] =dep.keyOrdering match {
caseSome(keyOrd: Ordering[K]) =>
// Create an ExternalSorter to sort the data.
val sorter =
new ExternalSorter[K, C, C](context, ordering =Some(keyOrd), serializer =dep.serializer)
sorter.insertAllAndUpdateMetrics(aggregatedIter)
case None =>
aggregatedIter
}
// [5] 返回結果集迭代器
resultIter match {
case _: InterruptibleIterator[Product2[K, C]] => resultIter
case _ =>
// Use another interruptible iterator here to support task cancellation as aggregator
// or(and) sorter may have consumed previous interruptible iterator.
new InterruptibleIterator[Product2[K, C]](context, resultIter)
}
}
從上面可見,在BlockStoreShuffleReader.read()讀取數據有五步:
- [1] 初始化ShuffleBlockFetcherIterator,負責從executor中獲取 shuffle 塊
- [2] 將shuffle 塊反序列化為record迭代器
- [3] reduce端聚合數據:如果map端已經聚合過了,則對讀取到的聚合結果進行聚合。如果map端沒有聚合,則針對未合并的<k,v>進行聚合。
- [4] reduce端排序數據:如果需要對key排序,則進行排序。基于sort的shuffle實現過程中,默認只是按照partitionId排序。在每一個partition內部并沒有排序,因此添加了keyOrdering變量,提供是否需要對分區內部的key排序
- [5] 返回結果集迭代器
下面我們詳細分析下ShuffleBlockFetcherIterator是如何進行fetch數據的
ShuffleBlockFetcherIterator是如何進行fetch數據的?
當shuffle reader創建 ShuffleBlockFetcherIterator 的實例時,迭代器調用在其initialize()方法。
// ShuffleBlockFetcherIterator
private[this] def initialize(): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
context.addTaskCompletionListener(onCompleteCallback)
// Local blocks to fetch, excluding zero-sized blocks.
val localBlocks = mutable.LinkedHashSet[(BlockId, Int)]()
val hostLocalBlocksByExecutor =
mutable.LinkedHashMap[BlockManagerId, Seq[(BlockId, Long, Int)]]()
val pushMergedLocalBlocks = mutable.LinkedHashSet[BlockId]()
// [1] 劃分數據源的請求:本地、主機本地和遠程塊
// Partition blocks by the different fetch modes: local, host-local, push-merged-local and
// remote blocks.
val remoteRequests = partitionBlocksByFetchMode(
blocksByAddress, localBlocks, hostLocalBlocksByExecutor, pushMergedLocalBlocks)
// [2] 以隨機順序將遠程請求添加到我們的隊列中
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
assert((0 ==reqsInFlight) == (0 ==bytesInFlight),
"expected reqsInFlight = 0 but found reqsInFlight = " +reqsInFlight+
", expected bytesInFlight = 0 but found bytesInFlight = " +bytesInFlight)
// [3] 發送remote fetch請求
// Send out initial requests for blocks, up to our maxBytesInFlight
fetchUpToMaxBytes()
val numDeferredRequest = deferredFetchRequests.values.map(_.size).sum
val numFetches = remoteRequests.size -fetchRequests.size - numDeferredRequest
logInfo(s"Started$numFetches remote fetches in${Utils.getUsedTimeNs(startTimeNs)}" +
(if (numDeferredRequest > 0 ) s", deferred$numDeferredRequest requests" else ""))
// [4] 支持executor獲取local和remote的merge shuffle數據
// Get Local Blocks
fetchLocalBlocks(localBlocks)
logDebug(s"Got local blocks in${Utils.getUsedTimeNs(startTimeNs)}")
// Get host local blocks if any
fetchAllHostLocalBlocks(hostLocalBlocksByExecutor)
pushBasedFetchHelper.fetchAllPushMergedLocalBlocks(pushMergedLocalBlocks)
}
在shuffle fetch的迭代器中,獲取數據請求有下面四步:
- [1] 通過不同的獲取模式對塊進行分區:本地、主機本地和遠程塊
- [2] 以隨機順序將遠程請求添加到我們的隊列中
- [3] 發送remote fetch請求
- [4] 獲取local blocks
- [5] 獲取host blocks
- [6] 獲取pushMerge的local blocks
劃分數據源的請求
private[this] def partitionBlocksByFetchMode(
blocksByAddress: Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])],
localBlocks: mutable.LinkedHashSet[(BlockId, Int)],
hostLocalBlocksByExecutor: mutable.LinkedHashMap[BlockManagerId, Seq[(BlockId, Long, Int)]],
pushMergedLocalBlocks: mutable.LinkedHashSet[BlockId]): ArrayBuffer[FetchRequest] = {
...
val fallback = FallbackStorage.FALLBACK_BLOCK_MANAGER_ID.executorId
val localExecIds =Set(blockManager.blockManagerId.executorId, fallback)
for ((address, blockInfos) <- blocksByAddress) {
checkBlockSizes(blockInfos)
// [1] 如果是push-merged blocks, 判斷其是否是主機的還是遠程請求
if (pushBasedFetchHelper.isPushMergedShuffleBlockAddress(address)) {
// These are push-merged blocks or shuffle chunks of these blocks.
if (address.host == blockManager.blockManagerId.host) {
numBlocksToFetch+= blockInfos.size
pushMergedLocalBlocks ++= blockInfos.map(_._1)
pushMergedLocalBlockBytes += blockInfos.map(_._2).sum
} else {
collectFetchRequests(address, blockInfos, collectedRemoteRequests)
}
// [2] 如果是localexecIds, 放入localBlocks
} else if (localExecIds.contains(address.executorId)) {
val mergedBlockInfos =mergeContinuousShuffleBlockIdsIfNeeded(
blockInfos.map(info =>FetchBlockInfo(info._1, info._2, info._3)), doBatchFetch)
numBlocksToFetch+= mergedBlockInfos.size
localBlocks ++= mergedBlockInfos.map(info => (info.blockId, info.mapIndex))
localBlockBytes += mergedBlockInfos.map(_.size).sum
// [3] 如果是host本地,并將其放入hostLocalBlocksByExecutor
} else if (blockManager.hostLocalDirManager.isDefined &&
address.host == blockManager.blockManagerId.host) {
val mergedBlockInfos =mergeContinuousShuffleBlockIdsIfNeeded(
blockInfos.map(info =>FetchBlockInfo(info._1, info._2, info._3)), doBatchFetch)
numBlocksToFetch+= mergedBlockInfos.size
val blocksForAddress =
mergedBlockInfos.map(info => (info.blockId, info.size, info.mapIndex))
hostLocalBlocksByExecutor += address -> blocksForAddress
numHostLocalBlocks += blocksForAddress.size
hostLocalBlockBytes += mergedBlockInfos.map(_.size).sum
// [4] 如果是remote請求,收集fetch請求, 每個請求的最大請求數據大小,是max(maxBytesInFlight / 5, 1L),這是為了提高請求的并發度,保證至少向5個不同的節點發送請求獲取數據,最大限度地利用各節點的資源
} else {
val (_, timeCost) = Utils.timeTakenMs[Unit] {
collectFetchRequests(address, blockInfos, collectedRemoteRequests)
}
logDebug(s"Collected remote fetch requests for$address in$timeCost ms")
}
}
val (remoteBlockBytes, numRemoteBlocks) =
collectedRemoteRequests.foldLeft((0L, 0))((x, y) => (x._1 + y.size, x._2 + y.blocks.size))
val totalBytes = localBlockBytes + remoteBlockBytes + hostLocalBlockBytes +
pushMergedLocalBlockBytes
val blocksToFetchCurrentIteration =numBlocksToFetch- prevNumBlocksToFetch
...
this.hostLocalBlocks++= hostLocalBlocksByExecutor.values
.flatMap { infos => infos.map(info => (info._1, info._3)) }
collectedRemoteRequests
}
- [1] 如果是push-merged blocks, 判斷其是否是主機的還是遠程請求
- [2] 如果是localexecIds, 放入localBlocks
- [3] 如果是host本地,并將其放入hostLocalBlocksByExecutor
- [4] 如果是remote請求,收集fetch請求, 每個請求的最大請求數據大小,是max(maxBytesInFlight / 5, 1L),這是為了提高請求的并發度,保證至少向5個不同的節點發送請求獲取數據,最大限度地利用各節點的資源
在劃分完數據的請求類別后,會依次的進行remote fetch請求,local blocks請求,host blocks請求和獲取pushMerge的local blocks。
那么數據是如何被Fetch的呢?接下來我們看下fetchUpToMaxBytes()方法。
private def fetchUpToMaxBytes(): Unit = {
// [1] 如果是延遲請求,如果可以遠程塊Fetch同時是未達到處理請求的字節數,進行send請求
if (deferredFetchRequests.nonEmpty) {
for ((remoteAddress, defReqQueue) <-deferredFetchRequests) {
while (isRemoteBlockFetchable(defReqQueue) &&
!isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
val request = defReqQueue.dequeue()
logDebug(s"Processing deferred fetch request for$remoteAddress with "
+ s"${request.blocks.length} blocks")
send(remoteAddress, request)
if (defReqQueue.isEmpty) {
deferredFetchRequests-= remoteAddress
}
}
}
}
// [2] 如果正常可以遠程Fetch請求,直接send請求;如果達到處理請求的字節,則創建remoteAddress的延遲請求
// Process any regular fetch requests if possible.
while (isRemoteBlockFetchable(fetchRequests)) {
val request = fetchRequests.dequeue()
val remoteAddress = request.address
if (isRemoteAddressMaxedOut(remoteAddress, request)) {
logDebug(s"Deferring fetch request for$remoteAddress with${request.blocks.size} blocks")
val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
defReqQueue.enqueue(request)
deferredFetchRequests(remoteAddress) = defReqQueue
} else {
send(remoteAddress, request)
}
}
}
Fetch請求字節數據:
- [1] 如果是延遲請求,如果可以遠程塊Fetch同時是未達到處理請求的字節數,進行send請求
- [2] 如果正常可以遠程Fetch請求,直接send請求;如果達到處理請求的字節,則創建remoteAddress的延遲請求
它會驗證該請求是否應被視為延遲。如果是,則將其添加到deferredFetchRequests中。否則,它會繼續并從BlockStoreClient實現發送請求(如果啟用了 shuffle 服務,則為ExternalBlockStoreClient ,否則為NettyBlockTransferService)。
// ShuffleBlockFetcherIterator
private[this] def sendRequest(req: FetchRequest): Unit = {
// ...
// [1] 創建了一個**BlockFetchingListener**,在完成請求后會被調用
val blockFetchingListener = new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// ...
remainingBlocks -= blockId
results.put(new SuccessFetchResult(BlockId(blockId), infoMap(blockId)._2,
address, infoMap(blockId)._1, buf, remainingBlocks.isEmpty))
// ...
}
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
results.put(new FailureFetchResult(BlockId(blockId), infoMap(blockId)._2, address, e))
}
}
// Fetch remote shuffle blocks to disk when the request is too large. Since the shuffle data is
// already encrypted and compressed over the wire(w.r.t. the related configs), we can just fetch
// the data and write it to file directly.
// [2] 如果請求大小超過可以存儲在內存中的請求的最大大小 ,則迭代器通過可選地定義DownloadFileManager來發送獲取請求
if (req.size > maxReqSizeShuffleToMem) {
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
blockFetchingListener, this)
} else {
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
blockFetchingListener, null)
}
在sendRequest中主要進行了以下兩個步驟:
- [1] 創建了一個BlockFetchingListener,在完成請求后會被調用
- [2] 如果請求大小超過可以存儲在內存中的請求的最大大小 ,則迭代器通過可選地定義DownloadFileManager來發送獲取請求
首先,ShuffleBlockFetcherIterator迭代器創建了一個BlockFetchingListener,在其中定義成功執行和實現執行后的回調函數,如果成功執行,它會首先為迭代器加synchronized鎖,然后將塊數據添加到結果變量中。如果發生錯誤,同樣會先加synchronized鎖,然后它將添加一個標記類來指示獲取失敗。
其次,ShuffleBlockFetcherIterator會調用BlockStoreClient的fetchBlocks方法,在調用前會判斷請求的內容的大小,如果超過門限,則傳參定義DownloadFileManager,它會使得shuffleData將被下載到臨時文件。
下面我們看下最終的fetchBlocks是如何實現的?
@Override
public void fetchBlocks(
String host,
int port,
String execId,
String[] blockIds,
BlockFetchingListener listener,
DownloadFileManager downloadFileManager) {
checkInit();
logger.debug("External shuffle fetch from {}:{} (executor id {})", host, port, execId);
try {
// [1] 首先創建并初始化RetryingBlockFetcher類,用它加載shuffle files
int maxRetries = transportConf.maxIORetries();
RetryingBlockTransferor.BlockTransferStarter blockFetchStarter =
(inputBlockId, inputListener) -> {
// Unless this client is closed.
if (clientFactory != null) {
assert inputListener instanceof BlockFetchingListener :
"Expecting a BlockFetchingListener, but got " + inputListener.getClass();
TransportClient client = clientFactory.createClient(host, port, maxRetries > 0);
// [2] 創建OneForOneBlockFetcher,用其進行下載shuffle Data
new OneForOneBlockFetcher(client, appId, execId, inputBlockId,
(BlockFetchingListener) inputListener, transportConf, downloadFileManager).start();
} else {
logger.info("This clientFactory was closed. Skipping further block fetch retries.");
}
};
...
// [3] 調用OneForOneBlockFetcher的start方法
blockFetchStarter.createAndStart(blockIds, listener);
}
}
- [1] 首先創建并初始化RetryingBlockFetcher類,用它加載shuffle files
- [2] 創建OneForOneBlockFetcher,用其進行下載shuffle Data
OneForOneBlockFetcher進行Shuffle 數據的下載
OneForOneBlockFetcher是基于RPC通信,從各個Executor端獲取shuffle數據,我們首先來簡要概述下:
- 首先,fetcher 會向持有 shuffle 文件的 executor發送FetchShuffleBlocks消息;
- 其次,executor將register new Stream 同時返回StreamHandle消息到fetcher, 它帶有streamId;
- 在收到StreamHandle響應后,client將stream或load 數據塊;
- 如果
downloadFileManager
不為空,則會將結果寫入臨時文件;對于內存的場景,shuffle bytes將加載到in-memory buffer中; - 最終,基于臨時文件還是基于內存都會調用sendRequest中定義的BlockFetchingListener回調函數。
獲取到的shuffle data會被放入到new LinkedBlockingQueue[FetchResult],并調用next()方法。如果所有可用的塊數據都已被消耗,迭代器將執行之前提供的 fetchUpToMaxBytes()。
ShuffleBlockFetcherIterator初始化完成后
在ShuffleBlockFetcherIterator初始化完成后,我們再來看看剩余的工作:
private class ShuffleFetchCompletionListener(var data: ShuffleBlockFetcherIterator)
extends TaskCompletionListener {
override def onTaskCompletion(context: TaskContext): Unit = {
if (data != null) {
data.cleanup()locations(blocksByAddress)
data = null
}
}
def onComplete(context: TaskContext): Unit = this.onTaskCompletion(context)
}
在ShuffleBlockFetcherIterator初始化完成后,會將其轉換為CompletionIterator,在其中主要是進行資源的釋放。然后借助于反序列化器將其將shuffle block反序列化為record迭代器。在將其包裝為metricIter 同于更新task的metric。之后再將其封裝為InterruptibleIterator迭代器。可中斷迭代器的作用是每次執行hasNext方法時,它都會分析任務狀態并最終終止托管此迭代器的任務。主要用于啟用了推測執行的情況。
val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
def hasNext: Boolean = {
// TODO(aarondav/rxin): Check Thread.interrupted instead of context.interrupted if interrupt
// is allowed. The assumption is that Thread.interrupted does not have a memory fence in read
// (just a volatile field in C), while context.interrupted is a volatile in the JVM, which
// introduces an expensive read fence.
context.killTaskIfInterrupted()
delegate.hasNext
}
接下來就是reduce端的聚合排序的操作, 注意這里需要在ShuffleDependency中定義, aggregator和keyOrdering,這些操作需要在PairRDDFunctions
中進行定義。
但是在SparkSQL中,它采用的是ShuffleExchangeExec并不會定義 aggregator和keyOrdering,那么Spark SQL是如何實現聚合和排序的呢?
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
...
} else {
interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
}
val resultIter: Iterator[Product2[K, C]] =dep.keyOrdering match {
caseSome(keyOrd: Ordering[K]) =>
val sorter =
new ExternalSorter[K, C, C](context, ordering =Some(keyOrd), serializer =dep.serializer)
sorter.insertAllAndUpdateMetrics(aggregatedIter)
case None =>
aggregatedIter
}
其實通過其執行計劃可以知道,其會在其中插入Sort算子來實現聚合排序。
到此為止,shuffle reader的大致過程已經走了一遍,但是還有很多的重要細節并沒有展開探討,那么這里就詳細總結下整體的流程:
Fetch前的準備
- fetch reader 的調用主要是ShuffledRDD和ShuffledRowRDD中,通過傳入 不同的partitionspecs給getReader傳入不同的調用參數。
- 在getReader中會先通過mapOutputTracker獲取mapid對應的shuffle文件的位置,然后在通過
BlockStoreShuffleReader
reader的唯一實現類進行shuffle fetch; - 在Driver端mapOutputTracker記錄mapId和對應的文件位置主要由MapOutputTrackerMaster進行維護,在創建mapShuffleStage時會向master tracker中注冊shuffleid, 在完成mapStage時會更新對應shuffleId中維護的mapid對應的位置信息。在Executor端從MapOutputTrackerWorker中獲取位置信息,如果獲取不到會向master tracker發送信息,同步信息過來;
處理Fetch請求
- 在BlockStoreShuffleReader中進行fetch時,會先創建ShuffleBlockFetcherIterator, 并將Fetch分為local, host local, remote不同方式;同時在Fetch時也會有些限制,包括每個Excutor阻塞的fetch request數和fetch shuffle數據是否大于分配的內存;如果請求的數據量過多,超過了內存限制,將通過寫入臨時文件實現;如果網絡通信開銷太大,fetcher 將停止讀取,并在需要下一個 shuffle 塊文件時恢復讀取。
- 最終的Fetch是通過OneForOneBlockFetcher實現的,fetcher 會向持有 shuffle 文件的 executor發送FetchShuffleBlocks消息,executor將register new Stream 同時將數據封裝為StreamHandle消息返回到fetcher,client最后再將加載數據塊;最終調用BlockFetchingListener回調函數。
Fetch后的處理
- reduce端聚合數據:如果map端已經聚合過了,則對讀取到的聚合結果進行聚合。如果map端沒有聚合,則針對未合并的<k,v>進行聚合。
- reduce端排序數據:如果需要對key排序,則進行排序。基于sort的shuffle實現過程中,默認只是按照partitionId排序。在每一個partition內部并沒有排序,因此添加了keyOrdering變量,提供是否需要對分區內部的key排序
- 另外需要注意的是SparkSQL中并不會設置ShuffleDependency的排序和聚合,而是通過規則在邏輯樹中插入Sort算子實現的。
學完Shuffle Reader下面是一些思考題:
- 為什么在調用getReader時要根據partitionspecs的不同傳遞不同的參數?主要的作用是什么?
- 遠程Fetch和本地Fetch最大的區別是什么?
- InterruptibleIterator 和 CompletionIterator 迭代器的作用是什么?