1.1 Concurrency Control
In a clustered environment, critical data is often shared storage, such as on a shared disk. The various nodes of the data have the same access, then there must be some mechanism to control node access to data. Oracle RAC is the use of DLM (Distribute Lock Management) mechanism for concurrency control among multiple instances.
1.2 Amnesia (Amnesia)
Cluster environment is not a centralized storage of configuration files, but each node has a local copy of the normal operation of the cluster, the user can cluster in the configuration of any node Genggai, and this change will be automatically synchronized to other nodes.
There is a special case: Node A normal shut down, Node B, changes in the configuration, close the node A, start the node B. This case, modify the configuration file is missing, is called amnesia.
1.3 Brain split (Split Brain)
In the cluster, the nodes through a mechanism (heart) to understand each other's health, to ensure coordination of all nodes. Suppose only the "heart" problems, each node is still running, this time, each node that is down the other nodes, and that he is the cluster environment, "the only built in those" who should get the whole cluster The "control." In a clustered environment, storage devices are shared, which means that data disaster, this is the "split brain"
The usual way to solve this problem is to use the voting method (Quorum Algorithm). Its algorithm mechanism is as follows:
Each node in the cluster heartbeat mechanism needs to inform each other of the "health state", assuming that each node receive a "briefing" on behalf of one vote. For the three nodes of the cluster, the normal operation, each node will have three votes. When node A node A heart failure but is still running, then the entire cluster will be split into two small to partition. Node A is one of the remaining two is a. This is a partition can be removed to protect the health of the cluster to run.
For three nodes of the cluster, A heart problems later, B and C is a partion, 2 votes, A is only one vote. In accordance with the voting algorithm, B and C received control of the composition of the clusters, A to be removed.
If only two nodes, the voting algorithm becomes ineffective. Because each node is only 1 vote. This required the introduction of a third device: Quorum Device. Quorum Device Hunger is usually shared disk, the disk is also called the Quorum disk. The Quorum Disk also represents one vote. When the two nodes of heart problems, two nodes at the same time to fight for Quorum Disk the vote, the earliest arrival of the request is the first meet. Therefore, the first access node on the Quorum Disk and get 2 votes. Another node will be removed.
1.4 IO isolation (Fencing)
When the cluster a "split brain" problem, we can use the "voting method" to solve the cluster control who gets the problem. But this is not enough, we must also ensure that was driven out of the shared data nodes can not operate. This is IO Fencing problem to be solved.
IO Fencing achieve a hardware and software in 2 ways:
Software approach: to support SCSI Reserve / Release command of the storage devices can be used to achieve SG command. Normal node using the SCSI Reserve command "lock" storage devices, fault was found after the storage device is locked, you know they're being driven out of the cluster, that is their abnormal situation occurred, we should restart itself in order to restore to normal. This mechanism is called Sicide (suicide). Sun, and Veritas is using this mechanism.
Hardware mode: STONITH (Shoot The Other Node in the Head), the direct operation of the power switch in this way, when a node fails, another node can detect if it will issue an order through the serial port to control the power supply fault node switch, through a temporary power outage, but power means the fault node is restarted, this approach requires hardware support.
2 RAC cluster
In stand-alone environment, Oracle is running in the OS Kernel above. OS Kernel is responsible for managing hardware devices, and provides hardware access interface. Oracle does not directly operate the hardware, but to replace it with OS Kernel to complete the call request to the hardware.
In the cluster environment, the storage device is shared. OS Kernel are designed for stand-alone and can only control between multiple processes on a single visit. If you also on the OS Kernel services, we can not guarantee the coordination between multiple hosts. Then on the need to introduce additional control mechanism, in RAC, this mechanism is located between Oracle and the OS Kernel Clusterware, it will intercept the request before the OS Kernel, and then other nodes of the Clusterware, culminating in the completion of the upper request.
Prior to the Oracle 10G, RAC Clusterware needed dependence and hardware vendors such as SUN, HP, Veritas. From Oracle 10.1 version, Oracle launched its own cluster product. Cluster Ready Service (CRS), from RAC is not dependent on any cluster software vendor. In Oracle 10.2 version, the product was renamed: Oracle Clusterware.
So we can see the entire RAC cluster, in fact there are two clusters, the existence of a composition by the Clusterware software cluster, another cluster formed by the Database.
2.2 Clusterware component
Oracle Cluster is a separate installation package, installed in each node of the Oracle Clusterware will start automatically. Oracle Clusterware is running environment consists of two disk files (OCR, Voting Disk), a number of process and network elements.
2.2.1 disk file:
Clusterware requires two files during the operation: OCR and Voting Disk. This two files must be stored in the shared storage. OCR is used to solve the problem forgetfulness, Voting Disk is used to solve the problem forgetfulness. Oracle recommends using raw device to store the two files, each file to create a bare device, about 100M each raw device allocation of space is enough.
Forgetfulness problem is due to the configuration information for each node has a copy, modify, synchronize the node configuration information is not caused. Oracle solution is used in this configuration file on the shared storage, this file is the OCR Disk.
OCR save the cluster configuration information, configuration information for "Key-Value" Save the form of one. In Oracle 10g ago, this file is called Server Manageability Repository (SRVM). In Oracle 10g, this part has been redesigned, both known as OCR. In the Oracle Clusterware installation process, the installation program will prompt the user to specify the OCR location. And the user specified in this position will be recorded in the / etc / oracle / ocr.Loc (Linux System) or / var / opt / oracle / ocr.Loc (Solaris System) file. In Oracle 9i RAC, the reciprocal is srvConfig.Loc file. Oracle Clusterware will start inside the content according to the specified location read from the OCR content.
1). OCR key
The information is the OCR tree structure, there are three major branches. Are the SYSTEM, DATABASE, and CRS. Below each branch, there are many small branches. The recorded information can only be modified by the root user.
2) OCR process
Oracle Clusterware cluster stored in the OCR configuration information, so the contents of the OCR is very important that all of the OCR's operations to ensure the content integrity of the OCR, so ORACLE Clusterware to run the course, not all nodes can operate OCR Disk.
Memory in each node has a copy OCR content, this copy is called OCR Cache. Each node has a OCR Process to read and write OCR Cache, but only one node in the OCR process to read and write OCR Disk of the contents of this node is called OCR Master node. The node's OCR process is responsible for updating local and other nodes OCR Cache content.
All need to OCR the content of other processes, such as OCSSD, EVM and so called Client Process, these processes do not access the OCR Cache, but like the OCR Process sends a request, through OCR Process access the content, if you want to modify the content of OCR, but also to by the node's OCR Process as Master node of the OCR process to submit an application completed by the Master OCR Process physical read and write, and synchronize all nodes OCR Cache content.
126.96.36.199 Voting Disk
Voting Disk used to record the nodes of this paper's main members of the state, in case of split brain, the decision to get control of that Partion other Partion be removed from the cluster. When you install Clusterware also prompted to specify this location. After installation is complete, the following command to see through the Voting Disk here.
$ Crsctl query css votedisk
2.2.2 Clusterware background processes
Clusterware by a number of processes, of which the most important three are: CRSD, CSSD, EVMD. Clusterware in the final stage of installation, will require the implementation of root.sh script on each node, the script in / etc / inittab file the final start the process of adding these three items, so that each subsequent system startup, Clusterware will automatically start, which EVMD and CRSD exception if the two processes, the system will automatically restart these two processes, if the process is the CSSD abnormal system will immediately restart.
OCSSD this process is the most critical process Clusterware, if the abnormal process will cause the system to restart, the process CSS (Cluster Synchronization Service) service. CSS services through a variety of real-time monitoring cluster status heartbeat mechanism, providing the basis of cluster split brain protection services.
CSS services are two kinds of heartbeat mechanism: one is through the private network, Network Heartbeat, the other is through the Voting Disk to Disk Heartbeat.
This two kinds of heart has the largest delay, the Disk Heartbeat, this delay is called IOT (I / O Timeout); the Network Heartbeat, this delay is called MC (Misscount). The two parameters are in seconds, by default IOT than MC, by default, which two parameters are automatically determined Oracle, and is not recommended to adjust. By the following command to see the parameter values:
$ Crsctl get css disktimeout
$ Crsctl get css misscount
Note: In addition Clusterware need this process in a single node environment, if you use ASM, also need this process; the process used to support the ASM Instance and the communication between the RDBMS Instance. If the node using ASM to install RAC, one problem: RAC nodes require only a OCSSD process, and should be run $ CRS_HOME directory The following 's, requiring a stop the ASM, And by $ ORACLE_HOME / bin / localcfig.Sh delete the inittab entries before deletion. Before installing ASM, we also use this script to start OCSSD: $ ORACLE_HOME / bin / localconfig.Sh add.
CRSD is to achieve "high-availability (HA)" The main process, which provides services called CRS (Cluster Ready Service) service.
Oracle Clusterware is a component in the cluster level, it should be for the application layer resources (CRS Resource) to provide "high availability services," so, Oracle Clusterware must monitor these resources, and the abnormal operation of these resources to intervene, including the closure, restart process or transfer services. CRSD process is these services.
All the components that require high availability, will install the configuration when it comes to CRS Resource to the OCR in the form of registration, and CRSD process that is under the OCR contents to determine which monitor the process, how to monitor, but how do solve a problem. In other words, CRSD process responsible for monitoring the CRS Resource's operations, and to start, stop, monitor, Failover these resources. By default, CRS will automatically attempt to restart the resource 5 times, if still fails, then not try.
CRS Resource including the GSD (Global Serveice Daemon), ONS (Oracle Notification Service), VIP, Database, Instance and Service. These resources are divided into two categories:
GSD, ONS, VIP and Listener classes are Noteapps
Database, Instance and Service are Database-Related Resource class.
We can interpret it this way: Nodeapps each node that is only a sufficient, such as each node is only one Listener, and Database-Related Resource that is related to these resources and databases, without restrictions on the node, for example, a node can have multiple instances, each instance can have multiple Service.
GSD, ONS, VIP the three services are in the final Clusterware installation, implementation VIPCA create and register the time to OCR in. The Database, Listener, Instance and Configuration Service is in the process of their registration to the OCR automatically or manually in the.
EVMD this process is responsible for publishing the events generated by CRS (Event). The Event can be distributed to clients 2 ways: ONS and Callout Script. Users can customize the callback script, placed in a specific directory, so that when there is a some event occurs, EVMD will automatically scan the directory and call the user's script, this call is to be completed by racgevt process.
EVMD process in addition to publishing the incident outside the complex, which is between the two processes CRSD and CSSD bridge. CRS and CSS 2 services before the process of communication is done through EVMD.
RACGIMON this process is responsible for checking the health status of the database, for Service to start, stop, fail (Failover). This process creates a persistent connection to the database, regularly checks SGA in the specific information, the information is updated regularly by the PMON process.
OPROCD This process is called Process Monitor Daemon. If the non-Linux platforms, and does not use third-party cluster software, you will see this process. This process is used to check the nodes Processor Hang (CPU hang), if the activation time over 1.5 seconds, you think the work exceptions CPU will restart the node. That this process "IO isolation" feature. From its Windows platform, the service name: OraFnceService you can see its capabilities. In the Linux platform, is the use of Hangcheck-timer module to achieve the "IO isolation".
2.3 VIP principles and characteristics of
Oracle's TAF is based on VIP technology above. The difference between IP and the VIP with: IP is the use of TCP layer timeout, VIP is the application layer using the immediate response. VIP is a floating IP. When a node problem will automatically to another node.
Suppose a 2 node RAC, each node during normal operation there is a VIP. VIP1 and VIP2. When Node 2 fails, such as abnormal relationship. RAC will do the following:
1). CRS rac2 node abnormalities detected, will trigger Clusterware reconstruction, and finally to remove the cluster node rac2 from Node 1 to form a new cluster.
2). RAC's Failover mechanism of VIP Node 2 will move to Node 1, Node 1, then there are 3 PUBLIC NIC IP address: VIP1, VIP2, PUBLIC IP1.
3). VIP2 user connection requests will be routed to Node 1 IP layer
4). Because node 1, there VIP2's address, all packets will pass the routing layer, network layer, transport layer.
5). However, the Node 1 monitor VIP1 and public IP1 only two IP addresses. Do not listen VIP2, so the application layer does not correspond to the application to receive the packet, the error was caught immediately.
6). Customer segment can immediately receive this error, then customer segment will re-launch the connection request to VIP1.
1). VIP is a script created by VIPCA
2). VIP as Nodeapps types of CRS Resource to the OCR in the registration by the CRS to maintain state.
3). VIP will be bound to the node's public network card on the public network card it has two addresses.
4). When a node fails, CRS will fault VIP node to other nodes.
5). Each node will also monitor public Listener's public ip cards and VIP
6). Tnsnames.Ora client normally configured to point nodes VIP.
2.4 Clusterware log system
Oracle Clusterware of diagnosis, can only be carried out from the log and trace. And it's log system more complicated.
$ ORA_CRS_HOME \ log \ hostname \ alert.Log, this is the preferred view the file.
Clusterware daemon log:
crsd.Log: $ ORA_CRS_HOME \ log \ hostname \ crsd \ crsd.Log
ocssd.Log: $ ORA_CRS_HOME \ log \ hostname \ cssd \ ocsd.Log
evmd.Log: $ ORA_CRS_HOME \ log \ hostname \ evmd \ evmd.Log
Nodeapp log in:
$ ORA_CRS_HOME \ log \ hostname \ racg \
This release is nodeapp inside the log, including the ONS, and VIP, such as: ora.Rac1.ons.Log
Tools for implementation of the log:
$ ORA_CRS_HOME \ log \ hostname \ client \
Clusterware provides a number of command-line tool:
For example ocrcheck, ocrconfig, ocrdump, oifcfg and clscfg, these tools create log on on this directory
There are $ ORACLE_HOME \ log \ hostname \ client \ and
$ ORACLE_HOME \ log \ hostname \ racg also related to the log.
Note: The order sub-film article Xiaoming Zhang's "lying Oracle RAC"
This article comes from CSDN blog, reproduced, please indicate the source: http://blog.csdn.net/tianlesoftware/archive/2010/02/27/5331067.aspx