在我司的風電大數據項目中,出現了一個報錯
比如
Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {winSink-2=161803385}
我司為了實現Exactly Once的語義,采取了自行保管offset的方式。即Spark App提交后,從上一次任務結束的位置開始繼續讀取消息。但是這樣做會遇到問題,即上述的OffsetOutOfRangeException
,通常是因為Kafka
的retention expiration造成的。
在Kafka的配置中,需要關注這樣一條
public static final String AUTO_OFFSET_RESET_CONFIG = "auto.offset.reset";
public static final String AUTO_OFFSET_RESET_DOC = "What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): <ul><li>earliest: automatically reset the offset to the earliest offset<li>latest: automatically reset the offset to the latest offset</li><li>none: throw exception to the consumer if no previous offset is found for the consumer's group</li><li>anything else: throw exception to the consumer.</li></ul>";
當你直接通過比如Kafka的Client訪問時,即使你指定了一個不存在offset
,即大于上邊界或小于下邊界,Kafka
也將會根據這一條配置reset你的offset值,比如earliest
或latest
。
但是當你在Spark Streaming中指定了一個OutOfRange的初始offset時,Spark不會理會你的auto.offset.reset
,而是會出現文章開頭的報錯Offsets out of range with no configured reset policy for partitions
關于這一點的討論可以參見SPARK-19680。這里摘錄部分內容
The issue here is likely that you have lost data (because of retention expiration) between the time the batch was defined on the driver, and the time the executor attempted to process the batch. Having executor consumers obey auto offset reset would result in silent data loss, which is a bad thing.
There's a more detailed description of the semantic issues around this for kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937
If you've got really aggressive retention settings and are having trouble getting a stream started, look at specifying earliest + some margin on startup as a workaround. If you're having this trouble after a stream has been running for a while, you need more retention or smaller batches.
If you got an OffsetOutOfRangeException after a job had been running
for 4 days, it likely means that your job was not running fast enough
to keep up, and retention expired records that had not yet been
processed.
因而,對于上述這種情況,為了避免這樣的問題發生,需要在程序初始化時,校驗當前Kafka中的offset邊界情況。如果當前存儲的值低于最小值,應該調整為最小值。如何檢驗?可以參考我的另一篇博客:Fetch Offset range in Kafka
當然,這種丟失數據的情況通常是不應該出現的,應記錄或避免這個情況。
- 關于offset的管理,可以參見your-own-data-store
- 關于Flume、Kafka、Spark、TSDB,歡迎指教與交流