CN103995901A - Method for determining data node failure - Google Patents
Method for determining data node failure Download PDFInfo
- Publication number
- CN103995901A CN103995901A CN201410254980.8A CN201410254980A CN103995901A CN 103995901 A CN103995901 A CN 103995901A CN 201410254980 A CN201410254980 A CN 201410254980A CN 103995901 A CN103995901 A CN 103995901A
- Authority
- CN
- China
- Prior art keywords
- back end
- node
- application
- connect
- data node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Abstract
The invention discloses a method for determining a data node failure. The method is used for a distributed database. The method comprises the steps that all application nodes of the distributed database are accessed, and when any application node cannot be connected with a certain data node in the distributed database, the broadcast that the data node cannot be connected is sent to other application nodes; after other application nodes receive the broadcast, a connecting request is sent to the data node to determine whether the data node can be connected or not; when the number of the application nodes which cannot be connected with the data node reaches a set threshold value, the data node failure is determined. According to the method, the characteristic that the application nodes belong to different IPs is utilized, whether the data node is in failure or not is determined, the influence of network fluctuation on a single IP generated when a synchronous request is sent to the data node through the same IP can be avoided, and then the failure reason of the data node can be judged more accurately.
Description
Technical field
The present invention relates to distributed data base field, particularly a kind of method of specified data node failure.
Background technology
Along with the development of network technology, the storage to data and the requirement of access are more and more higher, and thus, distributed data base is arisen at the historic moment.The high scalability of distributed data base and high availability are that many websites that need non-stop run have solved a difficult problem.
Distributed data base, is made up of the subdata base being distributed on multiple computer nodes, and each subdata base being distributed on each computer node is called back end, and each back end is logically correlated with, and status is equality.In order to ensure the normal operation of whole distributed data base, must immediately understand the running status of each back end, to determine whether normally to provide service, whether specified data node is effective.And the reasons such as network fluctuation, hardware fault all may cause the inefficacy of back end, for example, network fluctuation can cause the temporary inefficacy of back end, and hardware fault is then back end permanent failure.Therefore need a kind of effectively means to determine whether current data node lost efficacy.
Cassandra is a set of distributed NoSQL Database Systems of increasing income.Due to the good scalability of Cassandra, adopted by numerous well-known website, become a kind of popular distributed structured data storage scheme.In Cassandra, the method that predicate node lost efficacy is the detection (Accrual Failure Detection) adopting based on Suspected Degree.The basic thought of the method is under distributed environment, judges by a value that represents inefficacy Suspected Degree whether back end lost efficacy.The method is in regular hour window, constantly send synchronization request to back end, if back end fails to respond synchronization message once, the value of the inefficacy Suspected Degree of this back end just adds 1 so, when the value of inefficacy Suspected Degree reaches after the threshold value of certain setting, just determine the permanent failure of this back end.
Owing to adopting the method for the above-mentioned detection based on Suspected Degree, send synchronization request by same IP to back end, can not well avoid because of the impact of network fluctuation on sent synchronization request, within a period of time, may produce the loss of synchronization request data and/or the response data of back end to synchronization request due to network fluctuation, and then may cause within a period of time of transmission synchronization request, the value of back end inefficacy Suspected Degree significantly increases, even make the reaching the threshold value setting of back end inefficacy Suspected Degree and be judged as permanent failure, but in fact after during this period of time, back end can be not still genuine permanent failure in upstate.Therefore, the method for the existing above-mentioned detection based on Suspected Degree in use may produce the erroneous judgement that back end lost efficacy.
Summary of the invention
In view of this, the invention provides a kind of method of specified data node failure, to judge that accurately back end is the temporary inefficacy causing because of network, or the permanent failure that causes of hardware reason.
The application's technical scheme is achieved in that
A method for specified data node failure, for distributed data base, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
Further, in all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.
Further:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
Further, described threshold value is the half of all application node quantity of the described distributed data base of access.
Further, after determining that this back end lost efficacy, described method also comprises:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
Further, after determining that this back end is effectively, described method also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
Further, in the time that any one application node does not connect certain back end in described distributed data base, mask the connection of this application node to this back end.
Further, each application node belongs to different IP.
Can find out from such scheme, in the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.
Brief description of the drawings
Fig. 1 is the method flow diagram of specified data node failure of the present invention;
Fig. 2 is embodiment of the present invention process flow diagram.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in further detail.
The method of specified data node failure of the present invention is for distributed data base, and as shown in Figure 1, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
Wherein, cannot to connect the quantity of the application node of this back end be to carry out in an arbitration node to statistics.The selection of arbitration node is: in all application nodes of the described distributed data base of access, an application node of selecting is arbitrarily as arbitration node.
Described arbitration node statistics cannot connect this back end and carry out by the following method:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
Unlike the prior art, method of the present invention is not connect after certain back end when a certain application node, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time, and then judge than prior art that more accurately back end is the temporary inefficacy causing because of network, or the permanent failure that hardware reason causes.
In said method of the present invention, after determining that this back end lost efficacy, also comprise:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
And then realize fail data node replacement.
After definite this back end is effective, method of the present invention also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
In the time that real network is applied, the quantity of the application node of access distributed data base is huge, and the IP address of each application node is different, and in distributed data base, has a large amount of back end.Below in conjunction with a specific embodiment, method of the present invention is described.In this embodiment, the total N of application node that supposes access distributed data base is individual, N>1, in distributed data base, there is M back end (M>1), wherein occur that N the application node i (1≤i≤N) in application node do not connect the back end j (back end j is any one in M back end) in distributed data base.As shown in Figure 2, this embodiment comprises the following steps:
Step 1, from N application node, select arbitrarily an application node as arbitration node, and in arbitration node, set a decision content, and decision content is initialized as to " 0 ", set a threshold value, and be N/2 by threshold value setting, enter afterwards step 2.
Step 2, in the time that application node i does not connect the back end j in distributed data base, send the broadcast that does not connect back end j to other application node, enter afterwards step 3.
When any one application node in all application nodes does not connect certain back end in distributed data base, also can further comprise, mask the connection of this application node to this back end.For example, in this step 2, in the time that application node i does not connect back end j, application node i masks its connection to back end j, and then can avoid application node i to initiate the connection to back end j always but not connect the network resource overhead that back end j causes.
Step 3, other application node receive that after the broadcast that does not connect back end j, j sends connection request to back end, enter afterwards step 4.
Step 4, other application node all can connection data node j by whether information send to described arbitration node, enter afterwards step 5.
What step 5, arbitration node received that all application nodes send whether can connection data node j information, and arbitration node often receive that 1 application node sends cannot connection data node j message, just decision content is added to 1 operation, enter afterwards step 6.
Step 6, arbitration node judge whether cumulative decision content reaches the threshold value N/2 of setting: if cumulative decision content reaches the threshold value N/2 of setting, specified data node j lost efficacy, and entered afterwards step 7; If cumulative decision content does not reach the threshold value N/2 setting, determine that this back end is effective, enters step 9 afterwards.
Step 7, back end j is deleted from described distributed data base, enter afterwards step 8.
Step 8, enable the backup node j ' of back end j, with alternate data node j.
Described decision content is reverted to initial value 0 by step 9, arbitration node, and notify application node i back end j effective, enters afterwards step 10;
Step 10, application node i receive after the effective message of back end j that arbitration node sends, and timing sends connection request to back end j, to wait for that back end j recovers to connect.
Adopt the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.
Claims (8)
1. a method for specified data node failure, for distributed data base, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, all sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
2. the method for specified data node failure according to claim 1, is characterized in that:
In all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.
3. the method for specified data node failure according to claim 2, is characterized in that:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
4. the method for specified data node failure according to claim 1, is characterized in that: described threshold value is the half of all application node quantity of the described distributed data base of access.
5. the method for specified data node failure according to claim 1, is characterized in that, after determining that this back end lost efficacy, described method also comprises:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
6. the method for specified data node failure according to claim 3, is characterized in that, after determining that this back end is effectively, described method also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
7. the method for specified data node failure according to claim 1, is characterized in that, in the time that any one application node does not connect certain back end in described distributed data base, masks the connection of this application node to this back end.
8. the method for specified data node failure according to claim 1, is characterized in that, each application node belongs to different IP.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410254980.8A CN103995901B (en) | 2014-06-10 | 2014-06-10 | A kind of method for determining back end failure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410254980.8A CN103995901B (en) | 2014-06-10 | 2014-06-10 | A kind of method for determining back end failure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103995901A true CN103995901A (en) | 2014-08-20 |
CN103995901B CN103995901B (en) | 2018-01-12 |
Family
ID=51310066
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410254980.8A Active CN103995901B (en) | 2014-06-10 | 2014-06-10 | A kind of method for determining back end failure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103995901B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105306545A (en) * | 2015-09-28 | 2016-02-03 | 浪潮(北京)电子信息产业有限公司 | Failover method and system for external service node of cluster |
CN105975212A (en) * | 2016-04-29 | 2016-09-28 | 深圳市永兴元科技有限公司 | Failure detection processing method and device for distributed data system |
CN108616566A (en) * | 2018-03-14 | 2018-10-02 | 华为技术有限公司 | Raft distributed systems select main method, relevant device and system |
CN112860799A (en) * | 2021-02-22 | 2021-05-28 | 浪潮云信息技术股份公司 | Management method for data synchronization of distributed database |
CN113783735A (en) * | 2021-09-24 | 2021-12-10 | 小红书科技有限公司 | Method, device, equipment and medium for identifying fault node in Redis cluster |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
US20120101987A1 (en) * | 2010-10-25 | 2012-04-26 | Paul Allen Bottorff | Distributed database synchronization |
CN102882792A (en) * | 2012-06-20 | 2013-01-16 | 杜小勇 | Method for simplifying internet propagation path diagram |
US20130246608A1 (en) * | 2012-03-15 | 2013-09-19 | Microsoft Corporation | Count tracking in distributed environments |
US20130297976A1 (en) * | 2012-05-04 | 2013-11-07 | Paraccel, Inc. | Network Fault Detection and Reconfiguration |
-
2014
- 2014-06-10 CN CN201410254980.8A patent/CN103995901B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101987A1 (en) * | 2010-10-25 | 2012-04-26 | Paul Allen Bottorff | Distributed database synchronization |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
US20130246608A1 (en) * | 2012-03-15 | 2013-09-19 | Microsoft Corporation | Count tracking in distributed environments |
US20130297976A1 (en) * | 2012-05-04 | 2013-11-07 | Paraccel, Inc. | Network Fault Detection and Reconfiguration |
CN102882792A (en) * | 2012-06-20 | 2013-01-16 | 杜小勇 | Method for simplifying internet propagation path diagram |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105306545A (en) * | 2015-09-28 | 2016-02-03 | 浪潮(北京)电子信息产业有限公司 | Failover method and system for external service node of cluster |
CN105306545B (en) * | 2015-09-28 | 2018-09-07 | 浪潮(北京)电子信息产业有限公司 | A kind of method and system of the external service node Takeover of cluster |
CN105975212A (en) * | 2016-04-29 | 2016-09-28 | 深圳市永兴元科技有限公司 | Failure detection processing method and device for distributed data system |
CN108616566A (en) * | 2018-03-14 | 2018-10-02 | 华为技术有限公司 | Raft distributed systems select main method, relevant device and system |
CN112860799A (en) * | 2021-02-22 | 2021-05-28 | 浪潮云信息技术股份公司 | Management method for data synchronization of distributed database |
CN113783735A (en) * | 2021-09-24 | 2021-12-10 | 小红书科技有限公司 | Method, device, equipment and medium for identifying fault node in Redis cluster |
Also Published As
Publication number | Publication date |
---|---|
CN103995901B (en) | 2018-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3855692A1 (en) | Network security monitoring method, network security monitoring device, and system | |
KR102076862B1 (en) | Network performance indicator visualization method and apparatus, and system | |
CN103995901B (en) | A kind of method for determining back end failure | |
CN107769943B (en) | Method and equipment for switching main and standby clusters | |
US10795744B2 (en) | Identifying failed customer experience in distributed computer systems | |
CN106559166B (en) | Fingerprint-based state detection method and equipment for distributed processing system | |
CN110149220A (en) | A kind of method and device managing data transmission channel | |
WO2014166265A1 (en) | Method, terminal, cache server and system for updating webpage data | |
US9596313B2 (en) | Method, terminal, cache server and system for updating webpage data | |
CN104065508A (en) | Application service health examination method, device and system | |
CN104579765A (en) | Disaster tolerance method and device for cluster system | |
US20170351560A1 (en) | Software failure impact and selection system | |
CN104301140A (en) | Service request responding method, device and system | |
CN104580432A (en) | Memcached system, memory cache data provision method and device, memory cache data maintenance method and device as well as cluster maintenance method and device | |
US11341842B2 (en) | Metering data management system and computer readable recording medium | |
CN106302412A (en) | A kind of intelligent checking system for the test of information system crushing resistance and detection method | |
US8521869B2 (en) | Method and system for reporting defects within a network | |
JP4724763B2 (en) | Packet processing apparatus and interface unit | |
US11477098B2 (en) | Identification of candidate problem network entities | |
CN106682040A (en) | Data management method and device | |
CN111309515B (en) | Disaster recovery control method, device and system | |
CN103530297A (en) | Method and device capable of automatically carrying out website analysis | |
CN103297480A (en) | System and method for automatically detecting application service | |
US8566634B2 (en) | Method and system for masking defects within a network | |
CN110636090B (en) | Data synchronization method and device under narrow bandwidth condition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |