Flume的官網(wǎng)地址:http://flume.apache.org/FlumeUserGuide.html#exec-source
source,sink,channel:https://www.iteblog.com/archives/948
簡介Flume:
Flume是Cloudera提供的一個高可用的,高可靠的,分布式的海量日志采集、聚合和傳輸系統(tǒng),F(xiàn)lume提供在日志采集系統(tǒng)中定制各類的數(shù)據(jù)發(fā)送方,用于接收數(shù)據(jù),同時提供對數(shù)據(jù)進(jìn)行簡單的處理,并寫到各個數(shù)據(jù)接收方中的過程。
流程結(jié)構(gòu):
Flume的主要結(jié)構(gòu)分為三部分:source,channel,sink;其中source是源頭,負(fù)責(zé)采集日志;channel是管道,負(fù)責(zé)傳輸和暫時的存儲;sink為目的地,將采集的日志保存起來;
根據(jù)需求對Flume的三部分進(jìn)行組合,構(gòu)成一個完整的agent,處理日志的傳輸;PS:agent為flume處理消息的單位,數(shù)據(jù)經(jīng)過agent進(jìn)行傳輸。
具體配置:
(啟動flume的服務(wù)要安裝JDK)
source:包括的配置為{Avro,Thrift,Exec,Spooling,...}
sink:包括的配置為{HDFS,Hive,Avro,Thrift,Logger,...}
channel:包括的配置為{Memory,JDBC,Kafka,F(xiàn)ile,...}
PS:具體的配置參考Flume的官網(wǎng)進(jìn)行查詢;
一個簡單的Flume的配置為:
Sample example:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1 #定義數(shù)據(jù)的入口
a1.sinks = k1 #定義數(shù)據(jù)的出口
a1.channels = c1 #定義管道
# Describe/configure the source
a1.sources.r1.type = netcat #定義數(shù)據(jù)源的類型
a1.sources.r1.bind = localhost #監(jiān)聽地址
a1.sources.r1.port = 44444 #監(jiān)聽端口
# Describe the sink
a1.sinks.k1.type = logger? ? ? #定義數(shù)據(jù)出口,出口類型
# Use a channel which buffers events in memory
a1.channels.c1.type = memory? #臨時存儲文件的方式
a1.channels.c1.capacity = 1000 #存儲大小
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1? ? #配置連接的方式
a1.sinks.k1.channel = c1
source源的配置:(常見配置)
1.Avro配置:需要配置-channels,type,bind,port四組參數(shù);
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro #type為組件的名稱,需要填寫avro,avro方式使用RPC方式接收,故此需要端口號
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.bind = 0.0.0.0#限制接收數(shù)據(jù)的發(fā)送方ID,0.0.0.0是接收任何IP,不做限制
a1.sources.r1.port = 4141 #接收端口(與flume的客戶端的sink相呼應(yīng))
Avro的配置的話需要:需要在客戶端和接收端都要配置相應(yīng)的Avro配置才可。
2.exec source配置:可以通過指定操作對日志進(jìn)行讀取,使用exec時需要指定shell命令,對日志進(jìn)行讀取;exce的配置就是設(shè)定一個Linux命令,通過這個命令不斷的傳輸數(shù)據(jù);我們使用命令去查看tail -F 命令去查看日志的尾部;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exce #type為組件的名稱,采用命令行的方式進(jìn)行讀取數(shù)據(jù)
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.command = tail -F /var/log/secure #監(jiān)控查看日志的尾部進(jìn)行輸出的操作
3.spooling-directory source配置:spo_dir可以讀取文件夾里的日志,使用時指定一個文件夾,可以讀取文件夾中的所有文件,該文件夾下的文件不可以進(jìn)行再打開編輯的操作,spool的目錄下不可包含相應(yīng)的子目錄;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = spooldir #type為組件的名稱,需要填寫spooldir為spool類型
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.spoolDir = /home/hadoop/flume/logs #spool接收的目錄信息
4.Syslogtcp source配置:Syslogtcp監(jiān)聽tcp端口作為數(shù)據(jù)源;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp #type為組件的名稱,需要填寫syslogtcp,監(jiān)聽端口號dd
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.port = 5140 #端口號
a1.sources.r1.host = localhost #發(fā)送方的IP地址
5.HTTP Source配置:是HTTP POST和GET來發(fā)送事件數(shù)據(jù),使用Hander程序?qū)崿F(xiàn)轉(zhuǎn)換;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http #type為組件的名稱,需要填寫http類型
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.handler= org.example.rest.RestHandler
a1.sources.r1.handler.nickname=random.props
sink源的配置:(常見配置)
sink會消費(fèi)channel中的數(shù)據(jù),然后送給外部數(shù)據(jù)源或者source;
1.Hive sink:hive的數(shù)據(jù)是只限制了text和JSON數(shù)據(jù)直接在hive的表中或partition中;
hive的sink的主要參數(shù)詳解:
type:構(gòu)建的類型的名字,此處填寫hive;
hive.metastore:Hive metastore的URL;
hive.database:Hive database 名字;
hive table:Hive table名字;
使用hive的額話需要進(jìn)行先創(chuàng)建表的過程:
create table weblogs ( id int , msg string )
partitioned by (continent string, country string, time string)
clustered by (id) into 5 buckets
stored as orc;
a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg
2.Logger Sink:Logs是INFO level,This sink is the only exception which doesn’t require the extra configuration explained in the Logging raw data section.
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
3.Avro Sink:
需要配置hostname和port,需要在兩端都要安裝Avro的客戶端:
type:組件名稱,此處填寫Avro;
hostname和port:填寫地址;
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545
4.Kafka Sink:
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.ki.kafka.producer.compression.type = snappy
Channel 配置:(常見配置)
1.Memory Chanel:
使用Memory作為中間的緩存;
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
2.JDBC Channel:
a1.channels = c1
a1.channels.c1.type = jdbc
3.Kafka Channel:
a1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.channel1.kafka.bootstrap.servers = kafka-1:9092,kafka-2:9092,kafka-3:9092
a1.channels.channel1.kafka.topic = channel1
a1.channels.channel1.kafka.consumer.group.id = flume-consumer
4.File Channel:
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
source源的配置:(常見配置)
1.Avro配置:需要配置-channels,type,bind,port四組參數(shù);
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro #type為組件的名稱,需要填寫avro,avro方式使用RPC方式接收,故此需要端口號
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.bind = 0.0.0.0#限制接收數(shù)據(jù)的發(fā)送方ID,0.0.0.0是接收任何IP,不做限制
a1.sources.r1.port = 4141 #接收端口(與flume的客戶端的sink相呼應(yīng))
Avro的配置的話需要:需要在客戶端和接收端都要配置相應(yīng)的Avro配置才可。
2.exec source配置:可以通過指定操作對日志進(jìn)行讀取,使用exec時需要指定shell命令,對日志進(jìn)行讀取;exce的配置就是設(shè)定一個Linux命令,通過這個命令不斷的傳輸數(shù)據(jù);我們使用命令去查看tail -F 命令去查看日志的尾部;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exce #type為組件的名稱,采用命令行的方式進(jìn)行讀取數(shù)據(jù)
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.command = tail -F /var/log/secure #監(jiān)控查看日志的尾部進(jìn)行輸出的操作
3.spooling-directory source配置:spo_dir可以讀取文件夾里的日志,使用時指定一個文件夾,可以讀取文件夾中的所有文件,該文件夾下的文件不可以進(jìn)行再打開編輯的操作,spool的目錄下不可包含相應(yīng)的子目錄;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = spooldir #type為組件的名稱,需要填寫spooldir為spool類型
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.spoolDir = /home/hadoop/flume/logs #spool接收的目錄信息
4.Syslogtcp source配置:Syslogtcp監(jiān)聽tcp端口作為數(shù)據(jù)源;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp #type為組件的名稱,需要填寫syslogtcp,監(jiān)聽端口號dd
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.port = 5140 #端口號
a1.sources.r1.host = localhost #發(fā)送方的IP地址
5.HTTP Source配置:是HTTP POST和GET來發(fā)送事件數(shù)據(jù),使用Hander程序?qū)崿F(xiàn)轉(zhuǎn)換;
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http #type為組件的名稱,需要填寫http類型
a1.sources.r1.channels = c1 #匹配agent創(chuàng)建的channel即可
a1.sources.r1.handler= org.example.rest.RestHandler
a1.sources.r1.handler.nickname=random.props
sink源的配置:(常見配置)
sink會消費(fèi)channel中的數(shù)據(jù),然后送給外部數(shù)據(jù)源或者source;
1.Hive sink:hive的數(shù)據(jù)是只限制了text和JSON數(shù)據(jù)直接在hive的表中或partition中;
hive的sink的主要參數(shù)詳解:
type:構(gòu)建的類型的名字,此處填寫hive;
hive.metastore:Hive metastore的URL;
hive.database:Hive database 名字;
hive table:Hive table名字;
使用hive的額話需要進(jìn)行先創(chuàng)建表的過程:
create table weblogs ( id int , msg string )
partitioned by (continent string, country string, time string)
clustered by (id) into 5 buckets
stored as orc;
a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg
2.Logger Sink:Logs是INFO level,This sink is the only exception which doesn’t require the extra configuration explained in the Logging raw data section.
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
3.Avro Sink:
需要配置hostname和port,需要在兩端都要安裝Avro的客戶端:
type:組件名稱,此處填寫Avro;
hostname和port:填寫地址;
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545
4.Kafka Sink:
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.ki.kafka.producer.compression.type = snappy
Channel 配置:(常見配置)
1.Memory Chanel:
使用Memory作為中間的緩存;
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
2.JDBC Channel:
a1.channels = c1
a1.channels.c1.type = jdbc
3.Kafka Channel:
a1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.channel1.kafka.bootstrap.servers = kafka-1:9092,kafka-2:9092,kafka-3:9092
a1.channels.channel1.kafka.topic = channel1
a1.channels.channel1.kafka.consumer.group.id = flume-consumer
4.File Channel:
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data