在spark源碼閱讀之shuffle模塊①中,介紹了spark版本shuffle的演化史,提到了主要的兩個shuffle策略:HashBasedShuffle和SortedBasedShuffle,分別分析了它們的原理以及shuffle write過程,而中間的過程,也就是shuffleMapTask運算結果的處理過程在spark源碼閱讀之executor模塊③文章中也已經分析過,本章繼續分析下游的shuffle read過程,本篇文章源碼基于spark 1.6.3
shuffle read
shuffle read的起點應該是下游的Reducer來讀取中間落地文件,而除了需要從外部存儲取數據和已經cache或者checkpoint的RDD之外,一般的Task都是通過ShuffledRDD的shuffle read開始reduce之旅的。
首先可以看一下ShuffledRDD的compute()方法
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
調用ShuffleManager的getReader方法回去一個reader,之前說過,ShuffleManager這里有兩個實現類,HashShuffleManager和SortShuffleManager了,分別對應兩種不同的策略,但在shuffle read的過程中,他們的getReader方法都創建了同一個BlockStoreShuffleReader對象,也就是他們的shuffle read過程相同,接著應該點入BlockStoreShuffleReader的read()方法:
// shuffle read的核心實現,讀取map out結果并做聚合
override def read(): Iterator[Product2[K, C]] = {
val blockFetcherItr = new ShuffleBlockFetcherIterator(
context,
blockManager.shuffleClient,
blockManager,
mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
// Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)
// Wrap the streams for compression based on configuration
val wrappedStreams = blockFetcherItr.map { case (blockId, inputStream) =>
blockManager.wrapForCompression(blockId, inputStream) //將輸入根據參數進行壓縮
}
val ser = Serializer.getSerializer(dep.serializer)
val serializerInstance = ser.newInstance() //獲取序列化工具
// Create a key/value iterator for each stream
val recordIter = wrappedStreams.flatMap { wrappedStream =>
// Note: the asKeyValueIterator below wraps a key/value iterator inside of a
// NextIterator. The NextIterator makes sure that close() is called on the
// underlying InputStream when all records have been read.
serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator //將輸入反序列化為KeyValueIterator
}
// Update the context task metrics for each record read.
// 更新Task context的元數據信息
val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency()
val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
recordIter.map(record => {
readMetrics.incRecordsRead(1)
record
}),
context.taskMetrics().updateShuffleReadMetrics())
// An interruptible iterator must be used here in order to support task cancellation
val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter) //可取消的iter
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) { //需要聚合
if (dep.mapSideCombine) { //讀取map端已聚合過的數據
// We are reading values that are already combined
val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
} else { //僅需要reduce端的聚合
// We don't know the value type, but also don't care -- the dependency *should*
// have made sure its compatible w/ this aggregator, which will convert the value
// type to the combined type C
val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
}
} else { //不需要聚合
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
}
// Sort the output if there is a sort ordering defined.
dep.keyOrdering match { //判斷是否需要排序
case Some(keyOrd: Ordering[K]) => //如果需要排序
// Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
// the ExternalSorter won't spill to disk.
// 使用ExternalSorter進行排序,如果spark.shuffle.spill沒有開啟,那么數據是不會寫入硬盤的
val sorter =
new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = Some(ser))
sorter.insertAll(aggregatedIter)
context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
context.internalMetricsToAccumulators(
InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes)
CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
case None =>
aggregatedIter
}
}
這段代碼中已經做了注釋,切分一下有三塊功能:
- 用序列化工具讀取文件成為一個key/value iterator并更新Task context的元數據信息
- 根據傳入的Dependency中是否有聚合動作來對數據進行聚合處理
- 根據Dependency中是否存在key的排序器來對數據進行排序處理
其中,aggregator和keyOrdering對應著shuffle write過程中的相應參數,實現比較簡單,這里不做具體分析,我們主要關注下游是如何獲取數據的,這樣可以與上一篇文章一起形成關于shuffle整個過程的閉環。
block fetch
在第一部分中,首先創建了一個ShuffleBlockFetcherIterator對象,這個對象會創建一個(BlockID, InputStream)形式的Iterator來拉取中間文件的multiple blocks,這個對象在實例化的過程中首先會調用initialize()方法,以下是其源碼:
private[this] def initialize(): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
context.addTaskCompletionListener(_ => cleanup())
// Split local and remote blocks.
// 如果數據從其他節點上獲取,那么需要通過網絡
val remoteRequests: ArrayBuffer[FetchRequest] = splitLocalRemoteBlocks()
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
// Send out initial requests for blocks, up to our maxBytesInFlight
// sendFetchRequests發送請求,每次請求最大值為maxBytesInFlight(默認48MB),5個線程到5個節點
fetchUpToMaxBytes()
val numFetches = remoteRequests.size - fetchRequests.size
logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))
// Get Local Blocks
// 如果數據在本地,直接獲取即可
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
}
代碼中拉取數據有兩種,一種是remoteBlocks另一種localBlocks,如果數據不在本地節點上,那么就要通過網絡去獲取數據,通過網絡拉取就會占用網絡帶寬,所以系統提供了兩種策略,具體實現在splitLocalRemoteBlocks方法中:
private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
// Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
// smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
// nodes, rather than blocking on reading output from one node.
// 每次最多啟動5個線程到最多5個節點上讀取數據
// 每次請求的數據大小不會超過maxBytesInFlight的1/5
val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize)
// Split local and remote blocks. Remote blocks are further split into FetchRequests of size
// at most maxBytesInFlight in order to limit the amount of data in flight.
val remoteRequests = new ArrayBuffer[FetchRequest]
// Tracks total number of blocks (including zero sized blocks)
var totalBlocks = 0
for ((address, blockInfos) <- blocksByAddress) {
totalBlocks += blockInfos.size
if (address.executorId == blockManager.blockManagerId.executorId) {
// Filter out zero-sized blocks
// 需要過濾大小為0的本地block
localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
numBlocksToFetch += localBlocks.size
} else { // 需要遠程獲取的block
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
while (iterator.hasNext) {
val (blockId, size) = iterator.next()
// Skip empty blocks
if (size > 0) {
curBlocks += ((blockId, size))
remoteBlocks += blockId
numBlocksToFetch += 1
curRequestSize += size
} else if (size < 0) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) {
// Add this FetchRequest
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize = 0
}
}
// Add in the final request
if (curBlocks.nonEmpty) {
remoteRequests += new FetchRequest(address, curBlocks)
}
}
}
logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
remoteRequests
}
從代碼邏輯中可以得出通過網絡了拉取數據blocks的策略:
- 每次最多啟動5個線程到最多5個節點上讀取數據
- 每次請求數據的大小不會超過spark.reducer.maxMbInFlight(默認48MB)的五分之一
這么做的目的一個是減少占用帶寬,另一個是使用并行化請求數據減少請求時間。
請求已經切分好了,接下來通過調用fetchUpToMaxBytes()方法來發送請求:
private def fetchUpToMaxBytes(): Unit = {
// Send fetch requests up to maxBytesInFlight
while (fetchRequests.nonEmpty &&
(bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) {
sendRequest(fetchRequests.dequeue())
}
}
當請求大小不超過maxBytesInFlight,發送請求sendRequest()
private[this] def sendRequest(req: FetchRequest) {
logDebug("Sending request for %d blocks (%s) from %s".format(
req.blocks.size, Utils.bytesToString(req.size), req.address.hostPort))
bytesInFlight += req.size
// so we can look up the size of each blockID
val sizeMap = req.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMap
val blockIds = req.blocks.map(_._1.toString)
val address = req.address
// 通過網絡fetchBlocks的實現類為:NettyBlockTransferService,本地的fetchBlocks實現類為:BlockTransferService
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// Only add the buffer to results queue if the iterator is not zombie,
// i.e. cleanup() has not been called yet.
if (!isZombie) {
// Increment the ref count because we need to pass this to a different thread.
// This needs to be released after use.
buf.retain()
results.put(new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))
shuffleMetrics.incRemoteBytesRead(buf.size)
shuffleMetrics.incRemoteBlocksFetched(1)
}
logTrace("Got remote block " + blockId + " after " + Utils.getUsedTimeMs(startTime))
}
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), address, e))
}
}
)
}
通過ShuffleClient實例去拉取Blocks,這里的ShuffleClient有多種實現,其中通過網絡獲取Blocks的實現類為:NettyBlockTransferService,而本地獲取Blocks的實現類為:BlockTransferService,fetchBlocks方法中根據傳入的host地址端口和executorId,然后使用Netty協議去獲取數據。
接下來,我們再來看一下本地的數據拉取方法:
private[this] def fetchLocalBlocks() {
val iter = localBlocks.iterator
while (iter.hasNext) {
val blockId = iter.next()
try {
val buf = blockManager.getBlockData(blockId)
shuffleMetrics.incLocalBlocksFetched(1)
shuffleMetrics.incLocalBytesRead(buf.size)
buf.retain()
results.put(new SuccessFetchResult(blockId, blockManager.blockManagerId, 0, buf))
} catch {
case e: Exception =>
// If we see an exception, stop immediately.
logError(s"Error occurred while fetching local blocks", e)
results.put(new FailureFetchResult(blockId, blockManager.blockManagerId, e))
return
}
}
}
可以看出,本地的Blocks直接通過blockManager的getBlockData方法去獲取數據,而如果數據是通過shuffle過程獲取的,getBlockData就有兩種實現:Hash和Sort
Hash的實現類為:FileShuffleBlockResolver
Sort的實現類為:IndexShuffleBlockResolver
其中的不同就是Sort策略的getBlockData需要先通過IndexFile定位到數據對應的FileSegment,而Hash則可以直接通過blockId直接獲取文件.
以下是IndexShuffleBlockResolver的getBlockData方法:
override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
// The block is actually going to be a range of a single map output file for this map, so
// find out the consolidated file, then the offset within that from our index
val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId)
val in = new DataInputStream(new FileInputStream(indexFile))
try {
ByteStreams.skipFully(in, blockId.reduceId * 8) //跳到本次block的數據區
val offset = in.readLong() // 數據文件中的開始位置
val nextOffset = in.readLong() // 數據文件中的結束位置
new FileSegmentManagedBuffer(
transportConf,
getDataFile(blockId.shuffleId, blockId.mapId),
offset,
nextOffset - offset)
} finally {
in.close()
}
}
性能調優
通過兩篇對于shuffle的架構和源碼實現的分析,可以得出shuffle是Spark Core中比較復雜的模塊,也很影響性能,這里總結一下shuffle模塊中對性能有影響的系統配置:
spark.shuffle.manager
這個參數用來選擇shuffle的機制:Hash還是Sort,在spark 1.2版本后默認的機制已從Hash變成了Sort,而在2.0版本后,Hash機制已經退出歷史舞臺。那么選擇Hash還是Sort主要是取決于內存、排序和文件操作等多方面因素,如果產生的中間文件不是很多,那么采用Hash模式來避免不必要的排序可能是更好地選擇
spark.shuffle.sort.BypassMergeThreshold
這個配置的默認值是200,用于設置在Reducer的partitions數目少于這個值時,Sort Based Shuffle內部使用歸并排序的方式處理數據,而是直接將每個Partition寫入單獨的文件。這種方式可以看作Sort Based Shuffle在Shuffle量比較小的時候對Hash Based Shuffle的一種折中,當然它也存在中間文件過多的問題,如果GC或者內存使用比較緊張的話,可以適當降低這個值
spark.shuffle.compress和spark.shuffle.spill.compress
這兩個參數的默認配置都是true,前者是設置shuffle最終輸出到文件系統的文件是否壓縮,后者是在shuffle過程中處理數據寫入外部存儲的數據是否壓縮。
spark.shuffle.compress
如果下游的Task讀取上游結果的網絡IO成為瓶頸,那么可以考慮啟用壓縮來減少網絡IO,如果計算是CPU密集型的,那么將這個選項設置為false更為合適。
spark.shuffle.spill.compress
如果在處理中間結果spill到本地硬盤時,出現Disk IO,那么設置為true啟用壓縮可能會比較合適,如果本地硬盤是SSD的,那么設置為false會比較合適。
簡單來說,需要在項目中衡量壓縮、解壓縮帶來的時間消耗與磁盤、帶寬IO之間的利弊,具體情況,具體對待。
spark.reducer.maxSizeInFlight
這個參數用于限制一個Reducer Task向其他的Executor請求shuffle數據是所占用的最大內存數,默認值為48MB,如果帶寬限制較大,那么適當調小這個值,如果是萬兆網卡,可以考慮增大這個值。