The Hadoop Distributed Filesystem
1. Why HDFS ?
When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.
HDFS只是分布式文件管理系統(tǒng)中的一種
Moving Computation is Cheaper than Moving Data
2. The Design of HDFS
2.1 優(yōu)點
-
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
Very large files(適合大數(shù)據(jù)處理)
-
Streaming data access(流式數(shù)據(jù)訪問)
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. (一寫多讀)
A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time.(不支持文件的修改)
-
Commodity hardware
- Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware(可構(gòu)建在廉價機器上,通過多副本機制,提高可用性)
2.2 缺點
不適合低延時數(shù)據(jù)訪問,比如毫秒級的存儲數(shù)據(jù),是做不到的。
-
無法高效的對大量小文件進(jìn)行存儲
- 存儲大量小文件的話,它會占用 NameNode 大量的內(nèi)存來存儲文件、目錄和塊信息。這樣是不可取的,因為 NameNode 的內(nèi)存總是有限的。
- 面試常問:如何優(yōu)化HDFS對于小文件的存儲
* 小文件存儲的尋址時間會超過讀取時間,它違反了 HDFS 的設(shè)計目標(biāo)。
* 上傳的文件過小,上傳花費時間只有幾秒,但是尋址時間過長也是不合適的(訪問時間和傳輸時間達(dá)到某一比例,效率才最佳)
-
并發(fā)寫入、文件隨機修改
- 一個文件只能有一個寫,不允許多個線程同時寫。
*僅支持?jǐn)?shù)據(jù) append(追加),不支持文件的隨機修改。
- 一個文件只能有一個寫,不允許多個線程同時寫。
Block(面試題)
128 MB by default.Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,which are stored as independent units. (默認(rèn)大小在hadoop2.x中是128M,老版本中是64M,但是如果在本地運行,塊大小就是64M)
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of under‐lying storage.
Block抽象的好處
he first benefit is the most obvious: a file can be larger than any single disk in the network(文件大小可以非常大)
Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. (簡化存儲子系統(tǒng))
-
Furthermore, blocks fit well with replication for providing fault tolerance and availability. (數(shù)據(jù)冗余備份具有容錯性)
- each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. (每一個block備份三份在不同的機器上,一般備份三份,如果其中一個unavailable,系統(tǒng)會從其他位置讀取副本)
HDFS Architecture
3. Namenodes and Datanodes
- An HDFS cluster has two types of nodes operating in a master?worker pattern: a namenode (the master) and a number of datanodes (workers). (集群中有兩種節(jié)點:namenode和datanode)
Namenode
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. (namenode管理者文件系統(tǒng)的命名空間,存放元數(shù)據(jù))
Without the namenode, the filesystem cannot be used. all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes(沒有namenode,文件系統(tǒng)將不可能再被使用)
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. (客戶端通過與名稱節(jié)點和數(shù)據(jù)節(jié)點來訪問文件系統(tǒng))so the user code does not need to know about thenamenode and datanodes to function.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.(NameNode執(zhí)行命名空間操作:open,close,rename,以及決定Datanode的mapping of blocks)
Datanode
- Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.(Datanode是文件系統(tǒng)工作者,負(fù)責(zé)存儲和提取塊并且向namenode匯報block的存儲信息)
4. The File System Namespace
- The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Hadoop提供的兩種復(fù)原namenode的機制
The first way is to back up the files that make up the persistent state of the filesystem metadata.(第一種是備份文件系統(tǒng)元數(shù)據(jù)的持久化狀態(tài)到本地磁盤以及遠(yuǎn)程NFS掛載)
-
It is also possible to run a secondary namenode, which despite its name does not act as a namenode. (設(shè)置一個secondNode)
- Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.(作用為:定期mergenamespace和edit log并保存,防止namenode信息量過大)
5. Data Replication
All blocks in a file except the last block are the same size
-
An application can specify the number of replicas of a file
- The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.(Namenode定期接收DataNode的匯報)
A Blockreport contains a list of all blocks on a DataNode
-
Block Replication
image
5.1 Replica Placement: The First Baby Steps
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. (當(dāng)備份數(shù)量為3時,local mechine 一個,剩下兩個放在remote機架上的兩個node上)
The chance of rack failure is far less than that of node failure;
the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks.(文件的副本不均勻分布在機架上。三分之一的副本位于一個節(jié)點上,三分之二個副本在一個與三分之一副本同一機架上,其余第三個均勻分布在不同機架上的隨機節(jié)點。)
5.2 Replica Selection
- To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader.
5.3 Safemode
- On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state.
5.3.1 The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. (EditLog存放的是文件元數(shù)據(jù)的改變記錄以及每個文件的備份的改變記錄)
The NameNode uses a file in its local host OS file system to store the EditLog.
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.(FsImage中存放的是文件系統(tǒng)命名空間和block映射和文件系統(tǒng)屬性)
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a snapshot of the file system metadata and saving it to FsImage.(checkpoint的作用,每一次checkpoint會把文件系統(tǒng)元數(shù)據(jù)刷新到FsImage)
Even though it is efficient to read a FsImage, it is not efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the edits in the Editlog.(持續(xù)的將日志edits直接轉(zhuǎn)為FsImage是非常不高效的,因此我們選擇將日志記錄在EditLog中)
When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the Blockreport.(BlokReport:向namenode匯報data blocks)
-
A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint.
1.dfs.namenode.checkpoint.period 默認(rèn)為3600秒 ——The number of seconds between two periodic checkpoints. 2.dfs.namenode.checkpoint.txns 默認(rèn)為1000000 ——The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless of whether 'dfs.namenode.checkpoint.period' has expired.
5.4 Data Disk Failure, Heartbeats and Re-Replication
- Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them.(每一個datanode定期向namenode發(fā)送一個heartbeat message)
5.5 Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS.
Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a shared storage on NFS or using a distributed edit log (called Journal). The latter is the recommended approach.
6. Block Caching
- Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache. By default, a block is cached in only one datanode’s memory(對于使用較頻繁的文件可放在DataNode的內(nèi)存中)
7. HDFS Federation
Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace. Namespace volumes are independent of each other, which means namenodes do not communicate with one another, and furthermore the failure of one
namenode does not affect the availability of the namespaces managed by other namenodes. (多個namenode,互不相關(guān),管理著不同目錄下的文件信息)The namenode keeps a reference to every file and block in the filesystem in memory,which means that on very large clusters with many files, memory becomes the limiting factor for scaling
8. HDFS High Availability
雖然有進(jìn)行備份元數(shù)據(jù)的持久化狀態(tài)和設(shè)置SecondNamenode,但是namenode仍舊是文件系統(tǒng)的關(guān)鍵,一旦他遭到損壞,除非有新的namenode,否則文件系統(tǒng)將不會提供服務(wù)
hadoop添加了HA支持
8.1 HA內(nèi)容
In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.(此時用到了一對namenodes)
The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode(active namenode和standbynamenode共享edit log的存儲)
對于高實用性共享存儲有兩種選擇:NFS文件,QJM(quorum journal manager)。QJM專注于HDFS的實現(xiàn),其唯一目的就是提供一個高實用性的可編輯日志,也是大多是HDFS安裝時所推薦的。
Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.(Datanode同時向兩個namenodes匯報block的情況)
The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace
If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries
and an up-to-date block mapping. (如果active namenode發(fā)生故障 ,standby namenode會迅速接管任務(wù)(在數(shù)秒內(nèi)),因為在內(nèi)存中備份節(jié)點有最新的可用狀態(tài),包括最新的可編輯日志記錄和塊映射信息。)
從活動主節(jié)點到備份節(jié)點的故障切換是由系統(tǒng)中一個新的實體——故障切換控制器來管理的。雖然有多種版本的故障切換控制器,但是hadoop默認(rèn)的是ZooKeeper,它也可確保只有一個namenode是處于活動狀態(tài)。每一個namenode節(jié)點上都運行一個輕量級的故障切換控制器進(jìn)程,它的任務(wù)就是去監(jiān)控namenode的故障,一旦namenode發(fā)生故障,它就會觸發(fā)故障切換。
HA的實現(xiàn)會竭盡全力的去確保之前的活動主節(jié)點不會做出任何導(dǎo)致故障的有害舉動,這個方法就是fencing。
9. The Java Interface
Reading Data from a Hadoop URL
- 查看FileSystem.md
10. Data Flow——Anatomy of a File Read
-
step 1
- The client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem(open方法)
-
step 2
DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file.(DistributeFileSystem調(diào)用namenode,返回了文件的block locations)
For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client. (對于每一塊,namenode返回一個有block的最近的節(jié)點上的地址)
If the client is itself a datanode, the client will read from the local datanode if that datanode hosts a copy of the block。(還是就近原則,如果client本身為一個datanode,而且存在保存了文件的block,則客戶端會直接從本地讀取數(shù)據(jù))
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.(DistributedFileSystem向client客戶端返回了一個FSDataInputStream,F(xiàn)SDataInpuStream中還包裹著DFSInputtream,用來控制namenode和datanode之間的IO操作)
-
step 3
- The client then calls read() on the stream(客戶端通過DistributedFileSystem返回的FSDataInputStream對象調(diào)用read()方法)
-
step 4
- DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.(通過DFSInputStream中儲存的datanode的地址信息不斷地從block中讀取文件內(nèi)容)
-
step 5
When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client,which from its point of view is just reading a continuous stream.(當(dāng)?shù)匾粋€block讀完后,DFSdataInputStream將會關(guān)閉與此datanode的連接,而找下一個block繼續(xù)進(jìn)行讀取,從client的角度來說,這些操作均為透明的,意思是從客戶端的角度以為是在持續(xù)的讀取一個流)
Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed.(當(dāng)客戶端通過DFSInputStream打開的與某一節(jié)點上的塊的連接從而讀取數(shù)據(jù)時,DFSInputStream同樣通知namenode去獲得下一批需要的block的地址)
-
step 6
- When the client has finished reading, it calls close() on the FSDataInputStream (當(dāng)數(shù)據(jù)讀取完畢后,DFSInputStream)
-
Notes
During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks(DFSiInputStream在于DataNode交互遇到error時,會繼續(xù)選擇最近的block。而且會記錄failed datanodes)
The DFSInput Stream also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, the DFSInputStream attempts to read a replica of the block from another datanode; it also reports the corrupted block to the namenode.(DFSInputStream會向namenode報告corrupted block)
11. Network Topology and Hadoop
如何判斷網(wǎng)絡(luò)中兩個節(jié)點是近鄰的?
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor(將網(wǎng)絡(luò)看做了顆樹,兩個節(jié)點之間的距離是這兩個節(jié)點到他們共同祖先的距離和)
-
For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:
distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)——相同節(jié)點
distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)——同一機架上不同節(jié)點
distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)——同一數(shù)據(jù)中心不同機架上不同節(jié)點
distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)——不同數(shù)據(jù)中心節(jié)點
11. Data Flow——Anatomy of a File Write
- We’re going to consider the case of creating a new file, writing data to it, then closing the file.
-
step 1 && step 2
The client creates the file by calling create() on DistributedFileSystem .DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.(通過DistributedFileSystem通知namenode在namespace創(chuàng)建一個新文件,此時沒有任何與之相關(guān)聯(lián)的blocks)
-
The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. (namenode會進(jìn)行文件是否存在檢查以及cilent是否有權(quán)限進(jìn)行文件創(chuàng)建)
此時想到一個問題:若無權(quán)限,則運行程序時顯示:Permission denied
-
解決方法是:
* 修改hdfs-site.xml為屬性:dfs.permissions.enabled的值為false,意思是不進(jìn)行是否允許檢測
If these checks pass, the namenode makes a record of the new file;(兩個檢查均通過,則在namenode中創(chuàng)建一條記錄)
The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.(DistributedFileSystem向客戶端返回了包裹著DFSOutputStream的FSDataOutputStream。)
-
step 3
As the client writes data , the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. (當(dāng)客戶端寫出數(shù)據(jù)時,DFSOutputStream將數(shù)據(jù)分為packets,并寫入到data queue,data queue 被DataStreamer進(jìn)行管理,DataStreamer是用來向namenode申請分配block的)
The list of datanodes forms a pipeline(datanodes 形成管道)
-
step 4
- The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.(DataStreamer將packets傳遞給第一個datanode,存儲packet后傳遞給下一個datanode......)
-
step 5
The DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
If any datanode fails while data is being written to it, then the following actions are taken, First, the pipeline is closed,and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets。(首先,關(guān)閉pipeline,packets被加入到新的data queue中)
As long as dfs.namenode.replication.min replicas (which defaults to 1) are written,the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to 3).
-
step 6
When the client has finished writing data, it calls close() on the stream (step 6). This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete.
The namenode already knows which blocks the file is made up of (because Data Streamer asks for block allocations)
-
Replica Placement
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random.(第一個在Client本地或者隨機挑選一個節(jié)點)
The second replica is placed on a different rack from the first (off-rack), chosen at random.(第二個備份被放置在與第一份在不同機架上的不同節(jié)點)
The third replica is placed on the same rack as the second, but on a different node chosen at random.(第三份放在與第二份相同的機架上的不同節(jié)點)
Once the replica locations have been chosen, a pipeline is built, taking network topology into account.(備份位置選取完畢后,pipeline被成功創(chuàng)建)
- image
這樣三個備份存儲在兩個機架上
-
hadoop2.7.2以后的Replica Placement
第一個在client本地或者隨機選一個節(jié)點
第二個在與第一個同一機架上的不同節(jié)點
第三個在不同機架上的隨機節(jié)點
但是思想是不變的,都是三個備份,兩個機架