CN103995901A - Method for determining data node failure - Google Patents

Method for determining data node failure Download PDF

Info

Publication number
CN103995901A
CN103995901A CN201410254980.8A CN201410254980A CN103995901A CN 103995901 A CN103995901 A CN 103995901A CN 201410254980 A CN201410254980 A CN 201410254980A CN 103995901 A CN103995901 A CN 103995901A
Authority
CN
China
Prior art keywords
back end
node
application
connect
data node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410254980.8A
Other languages
Chinese (zh)
Other versions
CN103995901B (en
Inventor
赵晓平
唐超
马丽伟
秦波
王�锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410254980.8A priority Critical patent/CN103995901B/en
Publication of CN103995901A publication Critical patent/CN103995901A/en
Application granted granted Critical
Publication of CN103995901B publication Critical patent/CN103995901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention discloses a method for determining a data node failure. The method is used for a distributed database. The method comprises the steps that all application nodes of the distributed database are accessed, and when any application node cannot be connected with a certain data node in the distributed database, the broadcast that the data node cannot be connected is sent to other application nodes; after other application nodes receive the broadcast, a connecting request is sent to the data node to determine whether the data node can be connected or not; when the number of the application nodes which cannot be connected with the data node reaches a set threshold value, the data node failure is determined. According to the method, the characteristic that the application nodes belong to different IPs is utilized, whether the data node is in failure or not is determined, the influence of network fluctuation on a single IP generated when a synchronous request is sent to the data node through the same IP can be avoided, and then the failure reason of the data node can be judged more accurately.

Description

A kind of method of specified data node failure
Technical field
The present invention relates to distributed data base field, particularly a kind of method of specified data node failure.
Background technology
Along with the development of network technology, the storage to data and the requirement of access are more and more higher, and thus, distributed data base is arisen at the historic moment.The high scalability of distributed data base and high availability are that many websites that need non-stop run have solved a difficult problem.
Distributed data base, is made up of the subdata base being distributed on multiple computer nodes, and each subdata base being distributed on each computer node is called back end, and each back end is logically correlated with, and status is equality.In order to ensure the normal operation of whole distributed data base, must immediately understand the running status of each back end, to determine whether normally to provide service, whether specified data node is effective.And the reasons such as network fluctuation, hardware fault all may cause the inefficacy of back end, for example, network fluctuation can cause the temporary inefficacy of back end, and hardware fault is then back end permanent failure.Therefore need a kind of effectively means to determine whether current data node lost efficacy.
Cassandra is a set of distributed NoSQL Database Systems of increasing income.Due to the good scalability of Cassandra, adopted by numerous well-known website, become a kind of popular distributed structured data storage scheme.In Cassandra, the method that predicate node lost efficacy is the detection (Accrual Failure Detection) adopting based on Suspected Degree.The basic thought of the method is under distributed environment, judges by a value that represents inefficacy Suspected Degree whether back end lost efficacy.The method is in regular hour window, constantly send synchronization request to back end, if back end fails to respond synchronization message once, the value of the inefficacy Suspected Degree of this back end just adds 1 so, when the value of inefficacy Suspected Degree reaches after the threshold value of certain setting, just determine the permanent failure of this back end.
Owing to adopting the method for the above-mentioned detection based on Suspected Degree, send synchronization request by same IP to back end, can not well avoid because of the impact of network fluctuation on sent synchronization request, within a period of time, may produce the loss of synchronization request data and/or the response data of back end to synchronization request due to network fluctuation, and then may cause within a period of time of transmission synchronization request, the value of back end inefficacy Suspected Degree significantly increases, even make the reaching the threshold value setting of back end inefficacy Suspected Degree and be judged as permanent failure, but in fact after during this period of time, back end can be not still genuine permanent failure in upstate.Therefore, the method for the existing above-mentioned detection based on Suspected Degree in use may produce the erroneous judgement that back end lost efficacy.
Summary of the invention
In view of this, the invention provides a kind of method of specified data node failure, to judge that accurately back end is the temporary inefficacy causing because of network, or the permanent failure that causes of hardware reason.
The application's technical scheme is achieved in that
A method for specified data node failure, for distributed data base, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
Further, in all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.
Further:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
Further, described threshold value is the half of all application node quantity of the described distributed data base of access.
Further, after determining that this back end lost efficacy, described method also comprises:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
Further, after determining that this back end is effectively, described method also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
Further, in the time that any one application node does not connect certain back end in described distributed data base, mask the connection of this application node to this back end.
Further, each application node belongs to different IP.
Can find out from such scheme, in the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.
Brief description of the drawings
Fig. 1 is the method flow diagram of specified data node failure of the present invention;
Fig. 2 is embodiment of the present invention process flow diagram.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in further detail.
The method of specified data node failure of the present invention is for distributed data base, and as shown in Figure 1, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
Wherein, cannot to connect the quantity of the application node of this back end be to carry out in an arbitration node to statistics.The selection of arbitration node is: in all application nodes of the described distributed data base of access, an application node of selecting is arbitrarily as arbitration node.
Described arbitration node statistics cannot connect this back end and carry out by the following method:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
Unlike the prior art, method of the present invention is not connect after certain back end when a certain application node, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time, and then judge than prior art that more accurately back end is the temporary inefficacy causing because of network, or the permanent failure that hardware reason causes.
In said method of the present invention, after determining that this back end lost efficacy, also comprise:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
And then realize fail data node replacement.
After definite this back end is effective, method of the present invention also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
In the time that real network is applied, the quantity of the application node of access distributed data base is huge, and the IP address of each application node is different, and in distributed data base, has a large amount of back end.Below in conjunction with a specific embodiment, method of the present invention is described.In this embodiment, the total N of application node that supposes access distributed data base is individual, N>1, in distributed data base, there is M back end (M>1), wherein occur that N the application node i (1≤i≤N) in application node do not connect the back end j (back end j is any one in M back end) in distributed data base.As shown in Figure 2, this embodiment comprises the following steps:
Step 1, from N application node, select arbitrarily an application node as arbitration node, and in arbitration node, set a decision content, and decision content is initialized as to " 0 ", set a threshold value, and be N/2 by threshold value setting, enter afterwards step 2.
Step 2, in the time that application node i does not connect the back end j in distributed data base, send the broadcast that does not connect back end j to other application node, enter afterwards step 3.
When any one application node in all application nodes does not connect certain back end in distributed data base, also can further comprise, mask the connection of this application node to this back end.For example, in this step 2, in the time that application node i does not connect back end j, application node i masks its connection to back end j, and then can avoid application node i to initiate the connection to back end j always but not connect the network resource overhead that back end j causes.
Step 3, other application node receive that after the broadcast that does not connect back end j, j sends connection request to back end, enter afterwards step 4.
Step 4, other application node all can connection data node j by whether information send to described arbitration node, enter afterwards step 5.
What step 5, arbitration node received that all application nodes send whether can connection data node j information, and arbitration node often receive that 1 application node sends cannot connection data node j message, just decision content is added to 1 operation, enter afterwards step 6.
Step 6, arbitration node judge whether cumulative decision content reaches the threshold value N/2 of setting: if cumulative decision content reaches the threshold value N/2 of setting, specified data node j lost efficacy, and entered afterwards step 7; If cumulative decision content does not reach the threshold value N/2 setting, determine that this back end is effective, enters step 9 afterwards.
Step 7, back end j is deleted from described distributed data base, enter afterwards step 8.
Step 8, enable the backup node j ' of back end j, with alternate data node j.
Described decision content is reverted to initial value 0 by step 9, arbitration node, and notify application node i back end j effective, enters afterwards step 10;
Step 10, application node i receive after the effective message of back end j that arbitration node sends, and timing sends connection request to back end j, to wait for that back end j recovers to connect.
Adopt the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (8)

1. a method for specified data node failure, for distributed data base, the method comprises:
In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;
Other application node is received after described broadcast, all sends connection request to this back end, to determine whether connecting this back end;
In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.
2. the method for specified data node failure according to claim 1, is characterized in that:
In all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.
3. the method for specified data node failure according to claim 2, is characterized in that:
In described arbitration node, set a decision content, and described decision content is initialized as to 0;
When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;
Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;
When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:
If described decision content reaches the threshold value setting, determine that this back end lost efficacy;
If described decision content does not reach the threshold value setting, determine that this back end is effective.
4. the method for specified data node failure according to claim 1, is characterized in that: described threshold value is the half of all application node quantity of the described distributed data base of access.
5. the method for specified data node failure according to claim 1, is characterized in that, after determining that this back end lost efficacy, described method also comprises:
This back end is deleted from described distributed data base;
Enable the backup node of this back end.
6. the method for specified data node failure according to claim 3, is characterized in that, after determining that this back end is effectively, described method also comprises:
Described decision content is reverted to initial value 0;
The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.
7. the method for specified data node failure according to claim 1, is characterized in that, in the time that any one application node does not connect certain back end in described distributed data base, masks the connection of this application node to this back end.
8. the method for specified data node failure according to claim 1, is characterized in that, each application node belongs to different IP.
CN201410254980.8A 2014-06-10 2014-06-10 A kind of method for determining back end failure Active CN103995901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410254980.8A CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410254980.8A CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Publications (2)

Publication Number Publication Date
CN103995901A true CN103995901A (en) 2014-08-20
CN103995901B CN103995901B (en) 2018-01-12

Family

ID=51310066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410254980.8A Active CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Country Status (1)

Country Link
CN (1) CN103995901B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306545A (en) * 2015-09-28 2016-02-03 浪潮(北京)电子信息产业有限公司 Failover method and system for external service node of cluster
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN108616566A (en) * 2018-03-14 2018-10-02 华为技术有限公司 Raft distributed systems select main method, relevant device and system
CN112860799A (en) * 2021-02-22 2021-05-28 浪潮云信息技术股份公司 Management method for data synchronization of distributed database
CN113783735A (en) * 2021-09-24 2021-12-10 小红书科技有限公司 Method, device, equipment and medium for identifying fault node in Redis cluster

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
US20120101987A1 (en) * 2010-10-25 2012-04-26 Paul Allen Bottorff Distributed database synchronization
CN102882792A (en) * 2012-06-20 2013-01-16 杜小勇 Method for simplifying internet propagation path diagram
US20130246608A1 (en) * 2012-03-15 2013-09-19 Microsoft Corporation Count tracking in distributed environments
US20130297976A1 (en) * 2012-05-04 2013-11-07 Paraccel, Inc. Network Fault Detection and Reconfiguration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120101987A1 (en) * 2010-10-25 2012-04-26 Paul Allen Bottorff Distributed database synchronization
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof
US20130246608A1 (en) * 2012-03-15 2013-09-19 Microsoft Corporation Count tracking in distributed environments
US20130297976A1 (en) * 2012-05-04 2013-11-07 Paraccel, Inc. Network Fault Detection and Reconfiguration
CN102882792A (en) * 2012-06-20 2013-01-16 杜小勇 Method for simplifying internet propagation path diagram

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306545A (en) * 2015-09-28 2016-02-03 浪潮(北京)电子信息产业有限公司 Failover method and system for external service node of cluster
CN105306545B (en) * 2015-09-28 2018-09-07 浪潮(北京)电子信息产业有限公司 A kind of method and system of the external service node Takeover of cluster
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN108616566A (en) * 2018-03-14 2018-10-02 华为技术有限公司 Raft distributed systems select main method, relevant device and system
CN112860799A (en) * 2021-02-22 2021-05-28 浪潮云信息技术股份公司 Management method for data synchronization of distributed database
CN113783735A (en) * 2021-09-24 2021-12-10 小红书科技有限公司 Method, device, equipment and medium for identifying fault node in Redis cluster

Also Published As

Publication number Publication date
CN103995901B (en) 2018-01-12

Similar Documents

Publication Publication Date Title
EP3855692A1 (en) Network security monitoring method, network security monitoring device, and system
KR102076862B1 (en) Network performance indicator visualization method and apparatus, and system
CN103995901B (en) A kind of method for determining back end failure
CN107769943B (en) Method and equipment for switching main and standby clusters
US10795744B2 (en) Identifying failed customer experience in distributed computer systems
CN106559166B (en) Fingerprint-based state detection method and equipment for distributed processing system
CN110149220A (en) A kind of method and device managing data transmission channel
WO2014166265A1 (en) Method, terminal, cache server and system for updating webpage data
US9596313B2 (en) Method, terminal, cache server and system for updating webpage data
CN104065508A (en) Application service health examination method, device and system
CN104579765A (en) Disaster tolerance method and device for cluster system
US20170351560A1 (en) Software failure impact and selection system
CN104301140A (en) Service request responding method, device and system
CN104580432A (en) Memcached system, memory cache data provision method and device, memory cache data maintenance method and device as well as cluster maintenance method and device
US11341842B2 (en) Metering data management system and computer readable recording medium
CN106302412A (en) A kind of intelligent checking system for the test of information system crushing resistance and detection method
US8521869B2 (en) Method and system for reporting defects within a network
JP4724763B2 (en) Packet processing apparatus and interface unit
US11477098B2 (en) Identification of candidate problem network entities
CN106682040A (en) Data management method and device
CN111309515B (en) Disaster recovery control method, device and system
CN103530297A (en) Method and device capable of automatically carrying out website analysis
CN103297480A (en) System and method for automatically detecting application service
US8566634B2 (en) Method and system for masking defects within a network
CN110636090B (en) Data synchronization method and device under narrow bandwidth condition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant