CN103995901A

CN103995901A - Method for determining data node failure

Info

Publication number: CN103995901A
Application number: CN201410254980.8A
Authority: CN
Inventors: 赵晓平; 唐超; 马丽伟; 秦波; 王�锋
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-06-10
Filing date: 2014-06-10
Publication date: 2014-08-20
Anticipated expiration: 2034-06-10
Also published as: CN103995901B

Abstract

The invention discloses a method for determining a data node failure. The method is used for a distributed database. The method comprises the steps that all application nodes of the distributed database are accessed, and when any application node cannot be connected with a certain data node in the distributed database, the broadcast that the data node cannot be connected is sent to other application nodes; after other application nodes receive the broadcast, a connecting request is sent to the data node to determine whether the data node can be connected or not; when the number of the application nodes which cannot be connected with the data node reaches a set threshold value, the data node failure is determined. According to the method, the characteristic that the application nodes belong to different IPs is utilized, whether the data node is in failure or not is determined, the influence of network fluctuation on a single IP generated when a synchronous request is sent to the data node through the same IP can be avoided, and then the failure reason of the data node can be judged more accurately.

Description

A kind of method of specified data node failure

Technical field

The present invention relates to distributed data base field, particularly a kind of method of specified data node failure.

Background technology

Along with the development of network technology, the storage to data and the requirement of access are more and more higher, and thus, distributed data base is arisen at the historic moment.The high scalability of distributed data base and high availability are that many websites that need non-stop run have solved a difficult problem.

Distributed data base, is made up of the subdata base being distributed on multiple computer nodes, and each subdata base being distributed on each computer node is called back end, and each back end is logically correlated with, and status is equality.In order to ensure the normal operation of whole distributed data base, must immediately understand the running status of each back end, to determine whether normally to provide service, whether specified data node is effective.And the reasons such as network fluctuation, hardware fault all may cause the inefficacy of back end, for example, network fluctuation can cause the temporary inefficacy of back end, and hardware fault is then back end permanent failure.Therefore need a kind of effectively means to determine whether current data node lost efficacy.

Cassandra is a set of distributed NoSQL Database Systems of increasing income.Due to the good scalability of Cassandra, adopted by numerous well-known website, become a kind of popular distributed structured data storage scheme.In Cassandra, the method that predicate node lost efficacy is the detection (Accrual Failure Detection) adopting based on Suspected Degree.The basic thought of the method is under distributed environment, judges by a value that represents inefficacy Suspected Degree whether back end lost efficacy.The method is in regular hour window, constantly send synchronization request to back end, if back end fails to respond synchronization message once, the value of the inefficacy Suspected Degree of this back end just adds 1 so, when the value of inefficacy Suspected Degree reaches after the threshold value of certain setting, just determine the permanent failure of this back end.

Owing to adopting the method for the above-mentioned detection based on Suspected Degree, send synchronization request by same IP to back end, can not well avoid because of the impact of network fluctuation on sent synchronization request, within a period of time, may produce the loss of synchronization request data and/or the response data of back end to synchronization request due to network fluctuation, and then may cause within a period of time of transmission synchronization request, the value of back end inefficacy Suspected Degree significantly increases, even make the reaching the threshold value setting of back end inefficacy Suspected Degree and be judged as permanent failure, but in fact after during this period of time, back end can be not still genuine permanent failure in upstate.Therefore, the method for the existing above-mentioned detection based on Suspected Degree in use may produce the erroneous judgement that back end lost efficacy.

Summary of the invention

In view of this, the invention provides a kind of method of specified data node failure, to judge that accurately back end is the temporary inefficacy causing because of network, or the permanent failure that causes of hardware reason.

The application's technical scheme is achieved in that

A method for specified data node failure, for distributed data base, the method comprises:

In all application nodes of the described distributed data base of access, in the time that any one application node does not connect certain back end in described distributed data base, send the broadcast that does not connect this back end to other application node;

Other application node is received after described broadcast, sends connection request to this back end, to determine whether connecting this back end;

In the time cannot connecting the application node quantity of this back end and reach the threshold value setting, determine that this back end lost efficacy.

Further, in all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.

Further:

In described arbitration node, set a decision content, and described decision content is initialized as to 0;

When described other application node sends after connection request to this back end, all the information that whether can connect this back end is sent to described arbitration node;

Described arbitration node receives the information that whether can connect this back end that all application nodes are sent, and described arbitration node often receives the message that cannot connect this back end that an application node is sent, and just described decision content is done and once adds 1 operation;

When described arbitration node receives whether can connecting after the information of this back end that all application nodes send:

If described decision content reaches the threshold value setting, determine that this back end lost efficacy;

If described decision content does not reach the threshold value setting, determine that this back end is effective.

Further, described threshold value is the half of all application node quantity of the described distributed data base of access.

Further, after determining that this back end lost efficacy, described method also comprises:

This back end is deleted from described distributed data base;

Enable the backup node of this back end.

Further, after determining that this back end is effectively, described method also comprises:

Described decision content is reverted to initial value 0;

The application node timing that does not connect this back end sends connection request to this back end, to wait for that this back end recovers to connect.

Further, in the time that any one application node does not connect certain back end in described distributed data base, mask the connection of this application node to this back end.

Further, each application node belongs to different IP.

Can find out from such scheme, in the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.

Brief description of the drawings

Fig. 1 is the method flow diagram of specified data node failure of the present invention;

Fig. 2 is embodiment of the present invention process flow diagram.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in further detail.

The method of specified data node failure of the present invention is for distributed data base, and as shown in Figure 1, the method comprises:

Wherein, cannot to connect the quantity of the application node of this back end be to carry out in an arbitration node to statistics.The selection of arbitration node is: in all application nodes of the described distributed data base of access, an application node of selecting is arbitrarily as arbitration node.

Described arbitration node statistics cannot connect this back end and carry out by the following method:

Unlike the prior art, method of the present invention is not connect after certain back end when a certain application node, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time, and then judge than prior art that more accurately back end is the temporary inefficacy causing because of network, or the permanent failure that hardware reason causes.

In said method of the present invention, after determining that this back end lost efficacy, also comprise:

This back end is deleted from described distributed data base;

Enable the backup node of this back end.

And then realize fail data node replacement.

After definite this back end is effective, method of the present invention also comprises:

Described decision content is reverted to initial value 0;

In the time that real network is applied, the quantity of the application node of access distributed data base is huge, and the IP address of each application node is different, and in distributed data base, has a large amount of back end.Below in conjunction with a specific embodiment, method of the present invention is described.In this embodiment, the total N of application node that supposes access distributed data base is individual, N>1, in distributed data base, there is M back end (M>1), wherein occur that N the application node i (1≤i≤N) in application node do not connect the back end j (back end j is any one in M back end) in distributed data base.As shown in Figure 2, this embodiment comprises the following steps:

Step 1, from N application node, select arbitrarily an application node as arbitration node, and in arbitration node, set a decision content, and decision content is initialized as to " 0 ", set a threshold value, and be N/2 by threshold value setting, enter afterwards step 2.

Step 2, in the time that application node i does not connect the back end j in distributed data base, send the broadcast that does not connect back end j to other application node, enter afterwards step 3.

When any one application node in all application nodes does not connect certain back end in distributed data base, also can further comprise, mask the connection of this application node to this back end.For example, in this step 2, in the time that application node i does not connect back end j, application node i masks its connection to back end j, and then can avoid application node i to initiate the connection to back end j always but not connect the network resource overhead that back end j causes.

Step 3, other application node receive that after the broadcast that does not connect back end j, j sends connection request to back end, enter afterwards step 4.

Step 4, other application node all can connection data node j by whether information send to described arbitration node, enter afterwards step 5.

What step 5, arbitration node received that all application nodes send whether can connection data node j information, and arbitration node often receive that 1 application node sends cannot connection data node j message, just decision content is added to 1 operation, enter afterwards step 6.

Step 6, arbitration node judge whether cumulative decision content reaches the threshold value N/2 of setting: if cumulative decision content reaches the threshold value N/2 of setting, specified data node j lost efficacy, and entered afterwards step 7; If cumulative decision content does not reach the threshold value N/2 setting, determine that this back end is effective, enters step 9 afterwards.

Step 7, back end j is deleted from described distributed data base, enter afterwards step 8.

Step 8, enable the backup node j ' of back end j, with alternate data node j.

Described decision content is reverted to initial value 0 by step 9, arbitration node, and notify application node i back end j effective, enters afterwards step 10;

Step 10, application node i receive after the effective message of back end j that arbitration node sends, and timing sends connection request to back end j, to wait for that back end j recovers to connect.

Adopt the method for specified data node failure of the present invention, when a certain application node does not connect after certain back end, send connection request to determine whether connecting this back end by multiple application nodes to this back end, and then determine whether this back end lost efficacy, because each application node belongs to different IP, and then the impact due to network fluctuation, this single IP being caused can avoid in prior art sending synchronization request by same IP to back end time.The present invention judges that than prior art back end is the temporary inefficacy causing because of network more accurately, or the permanent failure that causes of hardware reason.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a method for specified data node failure, for distributed data base, the method comprises:

Other application node is received after described broadcast, all sends connection request to this back end, to determine whether connecting this back end;

2. the method for specified data node failure according to claim 1, is characterized in that:

In all application nodes of the described distributed data base of access, select any one application node as arbitration node, to add up the quantity of the application node that cannot connect this back end.

3. the method for specified data node failure according to claim 2, is characterized in that:

4. the method for specified data node failure according to claim 1, is characterized in that: described threshold value is the half of all application node quantity of the described distributed data base of access.

5. the method for specified data node failure according to claim 1, is characterized in that, after determining that this back end lost efficacy, described method also comprises:

This back end is deleted from described distributed data base;

Enable the backup node of this back end.

6. the method for specified data node failure according to claim 3, is characterized in that, after determining that this back end is effectively, described method also comprises:

Described decision content is reverted to initial value 0;

7. the method for specified data node failure according to claim 1, is characterized in that, in the time that any one application node does not connect certain back end in described distributed data base, masks the connection of this application node to this back end.

8. the method for specified data node failure according to claim 1, is characterized in that, each application node belongs to different IP.