Hash database design

2010-08-18  来源:本站原创  分类:Database  人气:250 

Hash database design

Table of Contents

  • A string with a hash of the key instead of the index to improve query efficiency.
    • 1.1 What is a hash
    • 1.2 How to use the hash in the database design
    • 1.3 using the computed column
    • 1.4 How to use multiple hash table associated
  • 2 library tables hash
    • 2.1 ORA HASH function
    • 2.2 Example 1 hash split
    • Split Example 2 2.3 hash
    • 2.4 In addition to leaving the balance law
    • 2.5 split Library

A hash of the key instead of the string with the index to improve query efficiency.

Index one of the most common way to improve query efficiency, but in a long string field index less effective, if more of these fields on the effect of the joint index is worse. For example,

column name data type
name1 varchar (50)
name2 varchar (100)
other column varchar (32)

If the name1 and name2 on regular Xu Yong inquiries, such as regular implementation

select * from table where name1=? and name2=?

this field will need to add these two indexes,

create index myindex on table1 (name1,name2)

Index to improve performance, but when such a huge amount of data on 100 million, query performance will be reduced quickly, but the index also take up storage space, large amount of data will also make the index space expansion. For example, in name1 and name2 row to establish a joint index would use at least 150 characters of space, huge eating space can imagine the post. According to an article that "index, the system takes up about 1.2 times the table hard disk and memory space to save the index"

Therefore, we use hash key

1.1 What is a hash

Simply put, that is, to a string, through an algorithm that returns an integer. Return integer and equal to the given string has a corresponding relationship. There are a variety of hash algorithms, see here http://en.wikipedia.org/wiki/Hash_function # Hash_function_algorithms

SQLSever provided checksum () function to generate the hash key. For example:

select checksum('abc','Tom and Jack')
 Returns  :1094199073
 On the test results in SQLServer2005  ( Following the same  )

Using this hash function, you can map out the long strings up to a hash value, this value is unique, if changes to the contents of the string, even one letter, the new hash value will be different. In fact, you can map non-string column, because the hash algorithm is essentially arbitrary length binary mapping algorithm, all data types can be expressed in binary.

Note that the actual application, there may be two different binary operations were hashed, even generate a hash with the key, Zhezhong little chance This is called a hash collision. Its appearance Yuanyin Yebufuza mainly due to the practical application of the hash output are generally fixed maximum length, over the length of the hash output may be to cut other operations. This makes it possible should not have the same hash key has been cut into equal. Specific information can be found here will not elaborate, but we need to be careful the hash value in practice this does not necessarily unique.

1.2 How to use the hash in the database design

Let us continue to the beginning of the table. When the data burst for a long time, built on the name1 and name2 field performance of the index have a problem. Well, according to hash principle, we add a Lie Haxi out this form, and then increase the index of this column, rather than on the name1 and name2 increase in the index, as such,

column name data type
name1 varchar (50)
name2 varchar (100)
other column varchar (32)
hashkey integer
create index hashkeyindex on table(hashkey)

hashkey content is name1 and name2 hash key, the checksum can be achieved through the database.

update table set hashkey=checksum(name1,name)

Above, this statement will be assigned to all hash out. Certainly a better approach is to establish the trigger, every time you insert a new record time on the hash out through the checksum (name1, name2) calculation and assignment.

In this way, based on the index on two fields into a field based on the index, more importantly, hashkey only a few byte size than the original name1 + name2 = 150 is the size of characters , index, storage is greatly reduced. In fact, the index itself will take some searching, and the index itself is smaller, the search must also faster. To query by name, for example, has such a query can hash out after

select * from table where hashkey=checksum(' John ', ' where is fly flying geese  ')

As we mentioned earlier are not necessarily unique hash key, that number may be entered into with an output operation (ie, hash key), conversely, a hash key may correspond to multiple records, So the query above may be found on more than one record, how? Very simple

select * from table where hashkey=checksum(' John ', ' where is fly flying geese  ') and name1=' John ' and name2 = ' place is fly flying geese  '

where clause in the hash key conditions have been greatly reduced to check the number of rows returned, and then filtered through the real conditions, soon found you really want. Some may feel this good, but after all, an increase of one database, but also to write a trigger each time to the new hash value is added in Gengxin. Some trouble. It can calculate out here to solve this problem.

Yet why the database is not stored directly in the joint index hash value it? Users do not have less trouble in it?

1.3 using the computed column

What is a computed column? Is derived by calculating to the column, the column is not real (but can be used to calculate out the actual PERSISTED keywords stored in the table). For example the third column and the last two columns, then the third column can be designed to calculate the column, check out the third column for each value is actually calculated out in the first two and the resulting. This design can reduce the extra storage, reduce DB size.

Design in the hash, you can create a hash calculation of the column, and then set out to create an index of the hash calculation, so it's stored in the table do not really realize just used to increase the efficiency of the hash column query business table completely separated.

Create index on computed column needs some conditions, such as computed column expression must be certainty, precision, can not string out and so on, but hash key are easy to satisfy.

CREATE TABLE mytable(
    name1 varchar(50),
    name2 varchar(150),
    myhashkey AS checksum(name1,name2)
)
Go
CREATE INDEX hashkeyIndex ON mytable(myhashkey)

1.4 How to use multiple hash table associated

Multi-table association can use hash join, whether to use the hash join is usually a database system to decide. Can also manually specify their own hand in the manner specified as follows

SELECT a.au_id FROM A a JOIN B b
ON a.name = b.name OPTION (HASH JOIN)

2 library tables hash

When the database table when a large degree, such as large as hundreds of millions or hundreds of million records, the query performance will plummet, even if the correct use of the index did not help. The need to use the library when the hash table approach to deal with, essentially split on the table, and then split according to some strategy on the table queries and other processing. Here I will focus on hashing strategy.

Hashing strategy is through the hash calculation, direct positioning table. For example: the user table, a huge amount of data needs to split. To the user's id as the hash calculation of keywords, such as userid is 32-bit string. SDDKSL3KDKSLDKE2FKQKSLEJF9ELAPEK. The simplest strategy is to use the userid on the left hash the first letter of the suffix as the new show. In this way, there will be a USER S table. 26 letters + 10 digits, so the user table can be split into 36 tables. If the query userid = UDKLSK3DKSLKDE2FKQSKLEJ9FELAPEKC the record, you can quickly navigate to the USER U table up. This is just to cite a simple example to illustrate the database hash table, the actual user name is often used to check, for example according to Joe Smith, John Doe to check, which requires the database to quickly locate Joe Smith, John Doe at which Zhang table. For uniform distribution of multiple users in different tables, you need a good hash function, we can not own design, and the use of ora hash function oracle instance.

2.1 ORA HASH function

Function usage  :
select orahash(' John  ',100,0) from dual

 Results  :79

Meaning: oracle hash function to calculate 'seating' of the hash value, and then put it into the bucket 79. 100 means that only allow a maximum of 100 barrels. 0 is the seed, used to calculate the hash value and the arrangements for the different barrels. Return value is the number of barrels allocated. The so-called barrel, nothing is set, that is a different hash values into different homogeneous groups in the.

Standard description  :orahash(exp,maxbucket,seedvalue)

2.2 Example 1 hash split

With this ready-made function, hash table dismantled relatively simple. If you want to split into 1000 form, it would only need to max bucket can be set to 1000.

First of all, to build user table 1000, namely user 000, user 001, user 002, user 003 & hellip;. User 999. When the user inserts, first calculate the user name you want to insert the table. Such as the user name is: 'Tiger'.

ora_hash ('Tiger', 1000,0) the result is 159, then the records need to be inserted into the user 159 which the table.

insert into user159 (name) values (' King Tiger  ')  -Are there other fields in the actual

query time, first locate the table, for example to find the user named 'Tiger', the record of empathy through the ora hash function first calculated by 159, then run:

select * from user 159 where name = 'Tiger' and

Other such databases are not ora hash function, so to their own design or to use other hash functions. In fact ora hash oracle hash function may be regional (hash partition) feature uses. The oracle hash partition is a large table of data to divide and conquer approach. A large table that is divided into multiple partitions well. Therefore, oracle database design, a compliance would be no need to split into multiple small tables, but directly to a large table split into multiple partitions, and then let oracle automatically to management on the line. But if you do distributed database, so a lot of work still need to do it by yourself.

Split Example 2 2.3 hash

Another way is to insert the data every time when it is carried out to determine whether the new hash table. Example, a call record to be inserted Wang Gang, pseudo code is as follows:

String userHashName = yourHashFun(' Wang gang  ');
String tableName = "user_"+userHashName;
if (!exists(tableName)){
    create table...
}
insert into ...

To do so, because of the need to do to judge whether the table already exists, so there is some performance loss.

2.4 In addition to leaving the balance law

orahash function well, but is oracle proprietary, left the oracle, sometimes we need to design your own hash function. In addition to leaving the balance method is simple and easy to use.

H(k)=k mod p  p<=m

If the userid is a number, can be carried out directly on the userid of this method. For example you want to split the 10 tables, to take p = 10. Such as userid = 11,11 / 10 than 1, then put userid = 11 records assigned to the table user1. If userid = 23,23 / 10 I 3, then record the userid = 23 assigned to the table user3. In addition to the benefits of leaving remainder method is one, simple; 2, could be relatively uniform distribution of the data to different sub-tables to (provided that userid own uniform).

If the userid is not a number, then hash calculation for the first of the userid (you can use ready-made hash calculation, such as md5, you can own design), by numeric value, then the digital value except to stay remainder method. Is uniform, but also on whether the uniform hash calculation results.

It has a drawback, the original list of 10 removed, the system needs to run a post-split into 50 tables will not be easy to achieve.

2.5 split Library

As mentioned above the table in addition to the demolition operation, database operations can be dismantled. Is to split a large table into multiple sub-tables into different databases. Such as user data in the table can be placed in five DB, respectively, the difference is DB1, DB2, DB3, DB4, DB5, each has a user table in the DB. 5 DB in the user record together is that all user records. demolition hash table library and removed the same way. Understand the split table, removed library also understand. The difference is, location library and location of the table means are different. Library may be distributed in different machines, so they need multiple sources.

A sample of the code below to illustrate this

public void test(){
    String userid = "123456";
    String hashvalue = hash(userid);
    String db_connection = getConnection(hashvalue);
}
// The users themselves can implement this function  ( This function can be very complicated to do, the data source can showcase  )
public abstract Connection getConnection(String db_mark);

// For example, following a
public Connection getConnection(String db_mark){
    String username="admin";
    String password="admin";
    String ip = "localhost";
    String dbname = "db"+db_mark; // Patchwork  db name.
    Class.forName(driveString).newInstance();
    conn = DriverManager.getConnection("jdbc:inetdae7:localhost:1433", username, password);
    conn.setCatalog(dbname);
    return conn;
}

To do with how to achieve, as long as you can think of a way to find the database you want to locate on the line. This process is the DB route. Routing strategy based on the actual business situation, ie we can do is very simple, you can do Gongnengqiangtai, infinitely more complex (Yin Wei to load-Jun Heng, to read and write Fen Li, to statistical purposes, to a summary Biao, to a Huanchong table, want this to that, more and more elements of the design is getting complicated).

Hash is to use a lot of little things that mess things (a number) to represent. This one small significant relationship between the corresponding reduction in the computing space, it greatly improves the efficiency of operations. This can be seen as a kind of compression, as a kind of small Mio (imagine moving your fingers, the index finger across the omnipotent one move, how spectacular). hash key are after, just as is a huge world to streamline the image. This is just for the billions of large data processing. The database is representative of this phenomenon everywhere.

  1. id. Each table has an id or similar id the unique identifier. Representative of a row of data, reduce the computing space. Various operations on the id of the record than in the entire operation faster and more data.
  2. Index. Reduce the query computing space.
  3. Hash. Can be achieved as you can span multiple tables Duokudesuo lead, to reduce query computation space.
    Small Mio, the use of the wonderful, Bottom of Heart.

[email protected]

相关文章
  • Hash database design 2010-08-18

    Hash database design Table of Contents A string with a hash of the key instead of the index to improve query efficiency. 1.1 What is a hash 1.2 How to use the hash in the database design 1.3 using the computed column 1.4 How to use multiple hash tabl

  • Database design - Restricted Category 2010-10-17

    Constraint modeling: First, the classification of constraints can be divided into: 1, the key (key) is the entity that uniquely identifies an entity focused on property or property set. There are two entities do not constitute the keys for all of its

  • "Let Oracle run faster 2 - mass data based on database design and optimization> 2011-08-02

    "Let the Oracle run faster 2-- mass data based on database design and optimization." Editor's Choice To the first author of the work experience of 10 years to build massive data-based database design and optimization of books Basic information A

  • Database design specifications 2010-11-03

    Database design specifications ¶ First, the naming ¶ 1 database, table, field, alias specification ¶ Identifier maximum length (in bytes) allowed the character database 64 [a-z_] (all characters are lower case, words separated by _ Split) Table 64 [a

  • Database design skills of 14 2009-04-20

    1. Original documents and the relationship between entities can be one-on-one, one-to-many, many-to-many relationship. Under normal circumstances, they are one-on-one relationship: the one and only original documents corresponding to the correspondin

  • Flashback database design skills (on) 2009-04-28

    Speaking of databases, in my opinion can not but talk about data structures. In 1996, I joined the University of studying computer programming, when teachers told us that: a computer program = data structure + algorithm. Even though the current proce

  • 14 database design skills 2009-05-05

    1. Original documents and the relationship between entities can be one-on-one, one-to-many, many-to-many relationship. Under normal circumstances, they are one-on-one relationship: the one and only original documents corresponding to the correspondin

  • Optimize the database design, program 2009-05-12

    This article first discusses the paradigm based on the basic design of database tables, set up focused on the primary key and index strategies and programs, and then expand the table from the database design and database table in terms of placing the

  • JAVA database design skills 2009-06-01

    The following 14 techniques are lot of people at many database analysis and design practice, summed up gradually. For the use of these experiences, the reader can not help bowls of Health, by rote, and to digest the understanding, seek truth from fac

  • SQL database design experience 2009-06-19

    SQL database design experience 2009-03-21 16:47 SQL database design experience A successful management system is: [50% + 50% of business software] formed, and 50% of the successful software and [25% database + 25% of the procedures] which consists of

  • Database design skills 2009-07-06

    Reprint: http://blog.csdn.net/hedylin/archive/2007/04/03/1550088.aspx Speaking of databases, I think the data structure can not but talk about. In 1996, I joined the University in computer programming, when the teacher told us that: a computer progra

  • Database design techniques in the 14 2009-07-15

    Database design techniques in the 14 Published: 2007-12-03 16:55 | Author: kider | Source: MySQL Community portal Author: maXiaoKe, Source: IT Experts Net / Javereserarch, Editor: LI Shu-qin, 2007-11-23 09:22 The following 14 techniques, many people

  • On database design skills 2009-10-22

    Said Database , I think the data structure can not but talk about. In 1996, I joined the University in computer programming, when the teacher told us that: a computer program = data structure + algorithm. Although the procedure has been developed mai

  • java database design techniques in the 14 2009-11-06

    The following 14 techniques, many people in a large number of database analysis and design practice, summed up gradually. For the use of these experiences, the reader can not help bowls of Health, by rote, and to digest the understanding, seek truth

  • Object-Oriented Database Design 2010-03-29

    Source: http://blog.csdn.net/coffeewoo/archive/2010/02/05/5291582.aspx Has recently received a number of User Object-Oriented Analysis on how the design of the database design questions. In the elephant-thinking in uml a book, I detail the object-ori

  • Database design, some doubts 2010-03-07

    1: The data in the database should be deleted using the logical or physical deleted? Quote Best not to physically remove, it should be clear that the removal of the so-called business and technology is not a point on the deletion 2: Change the databa

  • 3 large database design paradigm 2010-03-07

    Database design paradigm is needed to meet database design specifications that meet these specifications database is simple, clear structure, at the same time, does not occur insert (insert), delete (delete) and update (update) operation exception. O

  • Simple. Clear! Database design, analysis of the three major paradigms Application 2010-03-12

    Introduction Database design paradigm is needed to meet database design specifications that meet these specifications database is simple, clear structure, at the same time, does not occur insert (insert), delete (delete) and update (update) operation

  • Database design mode decomposition 2010-03-13

    Database design to make the user to delete, update, insert more convenient, to avoid data redundancy, abnormal phenomena; generally follow the three paradigms, in short, 1NF: property can not be divided, that is not in the table set the table 2NF: th

  • Database design techniques in 14 2010-03-13

    1. Original documents and the relationship between entities can be one to one, one to many, many to many relationship. In general, they are one to one relationship: the one and only original documents corresponding to a corresponding entity. In excep