Distributed hash (DHT)
Two key point: each node maintains only part of the route; each node stores only a part of the data. In order to achieve the network addressing and storage.
DHT is just a concept, put forward such a network model. And that it is very good for distributed storage. But how to achieve specific, DHT is not the scope.
Consistency of the hash:
An implementation of DHT. Nature or a hash algorithm. Recall that we usually do load balancing, according to the signature on the back-end node querystring modulus is the simplest and most commonly used algorithms, but the nodes of the obvious problems caused by additions and deletions: the original request down less than almost all on the same machine. That it is carp optimization algorithm so that only 1 / n of the data affected.
Consistency of the hash, it seems that first proposed in the distributed cache inside, so that when the shock node, with minimal impact. But now has been used in distributed storage and p2p systems inside.
Consistency of the hash is only made four concepts and principles, and no mention of specific implementation:
1, balance: the hash result as the average distributed to each node, so that each node can be fully utilized.
2, Monotonicity: The above also said, if it is signed modulus algorithm change will make the entire network node mapping changes. If carp, will make the 1 / n of the mapping changes. Consistent hash goal is to change the node will not change the network mapping.
3, spread: the same data, stored in different nodes, in other words, the system redundancy. Consistency of the hash is committed to reduce system redundancy degree.
4, load: load dispersion, and the balance is about the same meaning, but here is the data stored in more equilibrium, balance is the balance of the visit.
There are several consistent hashing algorithm, the key question is how to define the data partitioning strategy and fast query nodes.
chord as one of the classic implementation. cassandra in the DHT, is basically a simplified version of chord.
Network, each node is assigned a unique id, the mac address of the machine can do sha1, is the basis for network discovery.
Suppose there are N nodes in the network and the network is a ring. The distance between two nodes is defined as each node stores a routing table (finger table), in accordance with the table clockwise from the nodes 2,4,8,16,32. ... ... 2i log2N a selected distance from the other node ip information to record.
Storage: The data is cut according to certain rules, each data has a separate id (query key), and the range and node id is the same. Then find the node, as if the node id and data id, the data will exist on the node; if not, then the stored data from the closest node id. Meanwhile, in order to ensure the reliability of data, will find the K clockwise down redundant nodes, storing the data. Generally believed that the K = 3 is necessary.
Queries: start their own routing table, find the nearest one and data id, and the survival of the nodes in the network next. If the node id and data id equal coincidence, then I congratulate you. If not equal, to the next recursive lookup. General or need to go through multiple queries to find data where the nodes, and this number is less than or equal log2N can prove the.
In this process of query routing tables on the selection reflects the advantage, in fact implements a binary search, to observe the network from each node are divided into log2N block the network, the largest one which has N / 2 nodes . Routing table which is recorded every one of the first node. So that each time a query, at least half of the nodes excluded. Ensure log2N times to find the target node.
Add a node i, need to know in advance the survival of the network has a node j, then node j and interaction, and other nodes update their routing tables. And needs to be away from their nearest node in the copy over the data to provide data services.
Loss of a node, the routing algorithm will automatically skip this node, and rely on data to continue to provide redundant services.
KAD algorithm (Kademlia)
Personally feel, kad algorithm is optimized to do in the chord. Two main points:
1, with binary (32/64/128) that a node id, node id of two XOR to get the distance between nodes.
2, each node to maintain the routing information is more abundant, as is the whole network is divided into log2N were in accordance with the chord, is to keep log2N a routing node, but in the kad which is preserved log2N queues. Values for the configuration of each queue length K, the corresponding network node records the number of nodes in the region, and according to the time of these active nodes swapped out.
The first point is easy to divide the network, each node according to a binary 0 or 1 bit into a binary tree.