范例:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// advanced options:
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// This determines if a task will be failed if an error occurs in the execution of the task’s checkpoint procedure.
env.getCheckpointConfig().setFailOnCheckpointingErrors(true);
使用
StreamExecutionEnvironment.enableCheckpointing
方法來設置開啟checkpoint
(具體可以使用StreamExecutionEnvironment.getCheckpointConfig.enableCheckpointing(long interval)
),
或者StreamExecutionEnvironment.getCheckpointConfig.enableCheckpointing(long interval, CheckpointingMode mode)
。
interval
用于指定checkpoint
的觸發間隔(單位milliseconds
)
而CheckpointingMode
默認是CheckpointingMode.EXACTLY_ONCE
也可以指定為CheckpointingMode.AT_LEAST_ONCE
也可以通過
StreamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode
來設置CheckpointingMode
,
一般對于超低延遲的應用(大概幾毫秒)可以使用CheckpointingMode.AT_LEAST_ONCE
,其他大部分應用使用CheckpointingMode.EXACTLY_ONCE
就可以
checkpointTimeout
用于指定checkpoint
執行的超時時間(單位milliseconds
),超時沒完成就會被abort
掉。minPauseBetweenCheckpoints
用于指定checkpoint
距上一個checkpoint
完成之后最少等多久可以出發另一個checkpoint
,
當指定這個參數時,maxConcurrentCheckpoints
的值為1maxConcurrentCheckpoints
用于指定運行中的checkpoint
最多可以有多少個;
如果有設置了minPauseBetweenCheckpoints
,則maxConcurrentCheckpoints
這個參數就不起作用了(大于1的值不起作用)enableExternalizedCheckpoints
用于開啟checkpoints
的外部持久化,但是在job
失敗的時候不會自動清理,需要自己手工清理state
;ExternalizedCheckpointCleanup
用于指定當job canceled
的時候externalized checkpoint
該如何清理,DELETE_ON_CANCELLATION
的話,在job canceled
的時候會自動刪除externalized state
,但是如果是FAILED
的狀態則會保留;RETAIN_ON_CANCELLATION
則在job canceled
的時候會保留externalized checkpoint state
failOnCheckpointingErrors
用于指定在checkpoint
發生異常的時候,是否應該fail
該task
,默認為true
,如果設置為false
,則task
會拒絕checkpoint
然后繼續運行
flink-conf.yaml相關配置:
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
# state.backend: filesystem
# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
# Default target directory for savepoints, optional.
#
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend).
#
# state.backend.incremental: false
-
state.backend
用于指定checkpoint state
存儲的backend
,默認為none
-
state.backend.async
用于指定backend
是否使用異步snapshot
(默認為true
),有些不支持async
或者只支持async
的state backend
可能會忽略這個參數 -
state.backend.fs.memory-threshold
,默認為1024
,用于指定存儲于files
的state
大小閾[yù]值,如果小于該值則會存儲在root checkpoint metadata file
-
state.backend.incremental
,默認為false
,用于指定是否采用增量checkpoint
,有些不支持增量checkpoint
的backend
會忽略該配置 -
state.backend.local-recovery
,默認為false
-
state.checkpoints.dir
,默認為none
,用于指定checkpoint
的data files
和meta data
存儲的目錄,該目錄必須對所有參與的TaskManagers
及JobManagers
可見 -
state.checkpoints.num-retained
,默認為1,用于指定保留的已完成的checkpoints
個數 -
state.savepoints.dir
,默認為none
,用于指定savepoints
的默認目錄 -
taskmanager.state.local.root-dirs
,默認為none
小結:
可以通過使用
StreamExecutionEnvironment.enableCheckpointing
方法來設置開啟checkpoint
;具體可以使用enableCheckpointing(long interval)
,或者enableCheckpointing(long interval, CheckpointingMode mode)
checkpoint
的高級配置可以配置enableExternalizedCheckpoints
(用于開啟checkpoints
的外部持久化,在job failed
的時候externalized checkpoint state
無法自動清理,但是在job canceled
的時候可以配置是刪除還是保留state
)在
flink-conf.yaml
里頭也有checkpoint
的相關配置,主要是state backend
的配置,比如state.backend.async
、state.backend.incremental
、state.checkpoints.dir
、state.savepoints.dir
等
Java 配置實例:
/**
* 是否重啟標識flag
*/
private static boolean replayFlag = true;
/**
* 重啟次數
*/
private static Integer replayTimes;
/**
* 重啟時間間隔
*/
private static Integer replaySeconds;
private static Long checkPointTime;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
if (replayFlag) {
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
replayTimes,
Time.of(replaySeconds, TimeUnit.SECONDS)
));
CheckpointConfig config = env.getCheckpointConfig();
//env.setStateBackend(new FsStateBackend(checkPointDir));
// 任務流取消和故障時會保留Checkpoint數據,以便根據實際需要恢復到指定的Checkpoint
config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// 設置checkpoint的周期, 每隔1000 ms進行啟動一個檢查點
config.setCheckpointInterval(checkPointTime);
// 設置模式為exactly-once
config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 確保檢查點之間有至少500 ms的間隔【checkpoint最小間隔】
config.setMinPauseBetweenCheckpoints(500);
// 檢查點必須在一分鐘內完成,或者被丟棄【checkpoint的超時時間】
config.setCheckpointTimeout(checkPointTime);
// 同一時間只允許進行一個檢查點
config.setMaxConcurrentCheckpoints(1);
}