Cognitive interference decision method for air defense missile fuze based on reinforcement learning

2024-03-20 06:43DingkunHuangXiaopengYanJianDaiXinweiWangYangtianLiu
Defence Technology 2024年2期

Dingkun Huang, Xiaopeng Yan, Jian Dai, Xinwei Wang, Yangtian Liu

Science and Technology on Electromechanical Dynamic Control Laboratory, School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China

Keywords:Cognitive radio Interference decision Radio fuze Reinforcement learning Interference strategy optimization

ABSTRACT To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system, an interference decision method based Q-learning algorithm is proposed.First, dividing the distance between the missile and the target into multiple states to increase the quantity of state spaces.Second,a multidimensional motion space is utilized, and the search range of which changes with the distance of the projectile, to select parameters and minimize the amount of ineffective interference parameters.The interference effect is determined by detecting whether the fuze signal disappears.Finally, a weighted reward function is used to determine the reward value based on the range state, output power, and parameter quantity information of the interference form.The effectiveness of the proposed method in selecting the range of motion space parameters and designing the discrimination degree of the reward function has been verified through offline experiments involving full-range missile rendezvous.The optimal interference form for each distance state has been obtained.Compared with the single-interference decision method,the proposed decision method can effectively improve the success rate of interference.

1.Introduction

The radio fuze used for air defense missiles has variable modulation parameters,complex systems,and a solid anti-interference ability.The traditional fuze interference system can only implement a fixed interference strategy and cannot automatically adapt to the environment,the interference efficiency of a radio fuze with a variable system and parameters is low, and it may lose the interference opportunity and miss the interference time window.Additionally,it can easily produce ineffective interference behavior,so the interference system must choose the right interference strategy to improve the success rate of fuze interference.

Fuze interference decision-making is a process in which the interference system selects the interference form according to the working scene and fuze system [1,2].The commonly used interference types for radio fuzes include compression, deception,guidance, and forwarding.Different interference forms have their own advantages and characteristics,and the interference efficiency of a fuze also changes with the working scene, missile distance,interference parameters,fuze system,and other factors.If multiple interference forms are reasonably combined in the process of missile rendezvous, their advantages can be utilized to the maximum extent.Compared with a single interference form, the proper application of multiple interference patterns can improve interference efficiency.To obtain efficient interference patterns for the interference system under different conditions to optimize the interference decision, a reinforcement learning algorithm can be used to carry out multiple interactive experiments.

Reinforcement learning has been widely used in automatic control [3,4], radar interference decision-making [5,6], and other fields.It obtains strategies through multiple interactions between the system behavior and environment and can be applied online or offline.In fuze interference, the reinforcement learning model is tested many times through real-time parameter search, interference release, feedback, and reward function evaluation.Then,interference forms and parameters with high success rates at different distances are screened out, which conforms to the workflow of the cognitive interference system and does not require the production of data sets to facilitate model update.

In the application of reinforcement learning, Jiang's evaluation model introduced the concepts of target threat degree and missile interception effectiveness, fully considered both offensive and defensive factors, and adopted an intelligent allocation strategy based on reinforcement learning[7].Guo et al.successfully applied a reinforcement learning algorithm to radar interference decisionmaking by referring to the working environment and the system's action space[8-10].George,T.et al.expounded on the application of deep Q-learning(DQN)and an improved Q learning algorithm in cognitive radar [11-15].According to a literature search and technical analysis,the application of reinforcement learning technology in fuze interference needs to solve problems such as the small number of state spaces,the long time required to search for motion space parameters, and the design of the interference form reward function.

To apply reinforcement learning to the field of fuze countermeasures so that the interference system makes the optimal decision, this paper adopts the reinforcement-based Q-learning algorithm, takes the piecewise projectile distance as the state space, and randomly selects interference forms and parameters in each state through the multidimensional action space of search range change.The number of invalid interference parameter matches is thereby reduced.A weighted reward function is designed to comprehensively evaluate the performance of the interference form according to intuitive factors such as the starting condition of the fuze,the distance between projectiles,the transmit power,and the quantity of jamming parameters.A large number of effective interference patterns can be obtained through several offline experiments on missile rendezvous, and statistical reward values can yield effective interference patterns at different distances.According to the offline results, the corresponding interference form is employed for the incoming missile according to the distance of the missile target during actual decision-making.Compared with the results of the single unified interference strategy, the interference success rate of the fuze can be significantly improved.

2.Cognitive interference process of the radio fuze and the reinforcement-based Q-learning algorithm

2.1.Cognitive interference process of the radio fuze

A radio fuze uses the echo of the electromagnetic wave it emits to extract the corresponding target information, including the target azimuth, distance, speed, etc., so that it detonates the warhead at the best detonation point relative to the target.Interference signals need to go through the signal processing links of mixing,filtering,and detection and match the starting conditions of the fuze initiation signal threshold to make the fuze misjudge the starting distance and detonate early.Due to the difference in systems and specific design parameters, the types of fuze signal processing steps and initiation conditions vary widely [16-18].

Fig.1 shows the cognitive interference process of radio fuzes[19,20].The single workflow of the cognitive interference system includes several steps: signal reconnaissance, interference decision-making, and interference effect evaluation.Before releasing the interference signal, the system must detect the fuze system and signal parameters and adopt the appropriate interference form according to the information.After real-time detection,the information on whether the fuze has started can be obtained.If the interference is successful,the fuze is detonated in advance,and the system stores the interference information in the database,which is convenient for direct invoke in subsequent interference.If the interference fails, the method proceeds to the subsequent interference process.

Due to the high speed of an air defense missile,the interference system must react quickly to the incoming fuze.However, in the actual working environment, it is impossible to ensure that each interference operation can successfully start the fuze.The smaller the missile target distance is, the less reaction time and interference operation space will be left for the interference system.Therefore, to ensure the safety of the equipment, the objective of optimization should be to select interference forms with lower transmission power and fewer parameters [21-23].The interference forms and parameters selected for the interference system are shown in Table 1.

When the interference system executes the instruction,it needs to synthesize specific interference according to the parameters in Table 1.The patterns and parameters of the interference system compose the action space of the reinforcement learning algorithm.The more parameters there are, the higher the motion space dimension and the longer the search phase; however, an interference form with a high motion space dimension can accurately match the working state of the fuze and make fine adjustments according to the environment so that more interference energy can reach the inside of the fuze.The lower the dimension of the action space is, the faster the instruction delivery speed.As a commonly used suppressive interference method, the amplitude modulation of synthetic noise occupies fewer hardware resources, but this interference form cannot accurately match the working environment and fuze type[24].Therefore,it can be applied at the end of the missile-target rendezvous process to release the interference form quickly, which can better use the advantages of interference suppression.

2.2.Q-learning algorithm for fuze interference scenarios

The Q-learning algorithm is a decision-making algorithm that uses the Markov process.It interacts with the environment by randomly selecting actions in the action space and then changes the state of the environment.After much interaction,the algorithm can learn the corresponding relationship between the action and the state, and then it can choose the corresponding behavior according to its purpose.

Fig.1.Cognitive interference process of the radio fuze.

Table 1 Interference form parameters selected in this paper.

Fig.2.The action-state matrix of radio fuze interference processes.

The Markov process designed for the Q-learning algorithm needs to obtain the number of states N of the environment and form the state space S,S⊂{S1,S2,....SN}; The M behaviors of the system are decomposed into the action space A,A⊂{A1,A2,....AM},and an action-state matrix, also known as the Q-table, is established based on A and S, as shown in Fig.2.

The matrix in Fig.2 comprises the system's optional interference behavior and fuze state.Taking Q12 as an example,the value of Q in the table is the expectation of the maximum future reward value obtained by taking action 2 in the current state 1.At the beginning of training, actions are randomly selected, and the immediate reward obtained by the action is calculated through the reward function.The reward function is adjusted according to the design purpose of the model.Finally,the maximum reward value that can be obtained after the execution of the current action is estimated by the action or value function,and the Q-table is updated.The action function is as follows:

In Eq.(1), Q(s,a) is the value of the last action, R(s,a) is the immediate reward value of the last action, γ max Q′(s′,a′) is the product of the expectation and discount factor of the maximum reward value obtained by the current action, and α is the learning rate.

After many interactions between actions and the environment,the action function obtains the value of the Q-table and updates it continuously until it converges.After the training, the model directly selects the maximum value in the Q-table and takes actions according to the maximum Q value to reach the path to the target.Therefore, the Q-learning algorithm is often used in path search,strategy search, and other fields.

The core step of the Q-learning algorithm is updating the actionvalue function, that is, calculating the Q value, which is different from the reward value;the total reward expectation that the model can obtain when the target is reached is calculated after taking the current action.In other words, actions are executed step-by-step,and the current action impacts the subsequent action.The reward value obtained by the subsequent action is also related to the previous action.Meanwhile,the state of the radio fuze can only be activated or nonactivated,and these two states do not change with the release sequence of the interference form.The success of interference only depends on whether the interference parameters selected by the system in the current state and the synthesized single interference form can directly start the fuze.Removing the order of actions makes it impossible to build actions or value functions or update the Q-table.Suppose the Q-learning algorithm is directly applied to the field of fuze interference.In that case,the advantages of the Q-learning algorithm cannot be effectively utilized, and its function and workflow cannot match the working environment of the fuze interference system.

To adapt to the work scenario of fuze interference,the decision model proposed in this paper retains the core steps of the Qlearning algorithm, such as parameter search in the motion space,steps in the state space, and searching for the best outcome with the reward function,while abandoning the action-value function of the Q-learning algorithm.Fig.3 shows the cognitive interference decision model for the air defense missile fuze based on reinforcement learning,which is mainly composed of the action space,state space,reward function,and feedback mechanism.The input of the model gives the main parameters of the current fuze signal through the parameter estimation results, determines the current distance between the fuze and the interference system,and builds the state space based on the distance between the missile and the target.At this time, the motion space sets the parameter search range according to the state and randomly selects the interference form and corresponding interference parameters.After releasing the interference, the receiver determines whether the fuze signal disappears and calculates the score of the current interference form through the reward function according to the distance state,transmitted power, and parameter quantity information of the current interference form.After several offline experiments, the interference pattern with the highest score at multiple distances can be obtained, and an interference strategy with a high success rate can be obtained by releasing the corresponding interference pattern according to the distance.

Fig.3.Cognitive interference decision model for the air defense missile fuze based on reinforcement learning.

3.Q-learning algorithm-based anti-aircraft missile radio fuze interference decision model

3.1.State space design based on missile distance

The state space design based on missile target distance is the basis of the method proposed in this paper.It is assumed that the fuze starts when it is 1000 m from the target, the missile speed is Mach 4, and the rendezvous time of a full-range missile target is approximately 720 ms.As the missile approaches the target, the distance of the missile target is constantly reduced.In this paper,the projectile distance is divided into 10 distance states, each distance state is 100 m, and the time window is 72 ms, as shown in Fig.4.

In Fig.4, the interference system has to complete the closedloop process of interference, feedback, and calculation of the reward value at each distance state.Due to the large number of distance states, to facilitate description, the 10 distance states are integrated into three state stages,marked as the green,yellow and red areas in Fig.4.

Green area: The safe distance is defined when the projectile's distance is 1000-600 m.The interference system has the largest space for selecting interference parameters within a safe distance.An interference form with more parameters and a higherdimension motion space is preferred.The interference parameters have wider ranges;the purpose is to explore the effectiveness of interference forms under multiple parameters so that a model at a distance with less output power starts the fuze.

Yellow area: The warning distance is defined as 600-200 m.If the fuze enters the warning distance, the space for parameter selection will shrink rapidly so that the interference parameters can fit the main frequency and other essential parameters of the fuze and enhance the interference power.The purpose of a synthetic interference pattern within the warning distance is to start the fuze and ensure the interference effect quickly.

Fig.4.Design diagram of the missile distance state space.

Red area: The target distance is defined as the dangerous distance within the last 200 m.The interference system adopts high-power compression interference or aiming interference with prefabricated parameters to quickly concentrate all interference resources and ensure a high success rate of fuze interference.As the last means of interference, compression and prefabricated aiming interference require the fewest parameters and can be deployed quickly.

As shown in Fig.5, if interference is successful in a specific distance state, the missile rendezvous process will stop in this state;otherwise,the missile rendezvous process will continue.The interference pattern cannot determine whether to enter the next state but only the stopping position of the state.The interference form can be fine-tuned through 10 distance states.The three stages of integration can control the search range and optimization direction of the parameters.This design method can integrate the working characteristics of the fuze interference system into the model and ensure that the model has a fast convergence rate during the processes of combining various interference forms and optimizing parameters.

3.2.Motion space design

The motion space and state space are the core of the design of tabular reinforcement learning methods such as Q-learning.For the general path search model, there are only four motion spaces: up,down, left, and right.In the radio fuze interference system, the interference form must be selected for an interference behavior,and the specific parameters required by the interference form need to be determined.Therefore, the action space of the interference system is layered.The model first determines the interference form required by the current distance state, selects the corresponding parameters through the corresponding interference form parameter space matrix,and synthesizes the interference.To simplify the model and explore the interference performance of different interference forms on specific fuzes,the action space is mainly used to select and optimize the interference parameters.

This paper takes digital radio frequency memory (DRFM)interference as an example to illustrate the design method of the motion space.As seen from Tables 1 and 8 parameters need to be configured for DRFM interference, so 8 parameter matrices are constructed to form the multidimensional action space of DRFM interference.

As shown in Fig.6, the number of rows in each matrix corresponds to the distance state,represented by green,yellow or red in the figure.When the missile rendezvous scenario enters distance state 1, the model randomly selects parameters in the first row of each parameter matrix, thus obtaining 8 parameters and synthesizing the DRFM interference form.Each row has 10 parameters to be selected.Suppose that the model chooses DRFM as the interference form for the current distance state; there are 108different combinations of DRFM interference parameters.It is difficult to perform such a large number of interference experiments through a limited number of offline missile rendezvous experiments,and it is difficult to distinguish the performance of multiple similar parameter ratios by designing reward functions.

Fig.6.DRFM interference parameter matrix.

To reduce the number of interference parameters, this paper presents a search range configuration method based on the distance state, taking the first parameter of DRFM interference as an example.Fig.7 shows a schematic diagram of the parameter search range based on the dynamic change in the distance state.When a fuze signal is detected, the model can select common interference parameters according to the main frequency of the fuze signal and input them into the middle position of each parameter matrix, which is marked as the blue area in the figure.

Method of setting central parameters:

Main frequency: the center frequency of carrier frequency and fuze modulation bandwidth is measured by the receiving system of the jammer and set in the center position of the main frequency submatrix of the interference matrix.

Doppler frequency: The Doppler frequency is calculated based on the center frequency obtained from the receiver,along with the missile velocity measured by the radar and certain prior knowledge.

Special interference parameters: This paper integration various interference forms, such as AM, FM, DRFM, and sweep frequency,each requiring different parameters such as modulation depth in AM, modulation bandwidth in FM, relay time in DRFM, and dwell time in sweep frequency.Except for modulation bandwidth, these parameters cannot be adjusted in real time through parameter estimation or signal recognition modules in the front-end.To set the special interference parameters, this paper combines the setting method of prefabricated interference waveform with some prior knowledge, such as ensuring that the sweep bandwidth covers the working bandwidth of the fuze, setting the DRFM storage duration to be longer than the modulation period of the fuze.

Fig.5.State transfer process of missile rendezvous.

Fig.7.Schematic diagram of the dynamic policy search scope based on the distance state.

The selection of these special interference parameters is based on the common prefabricated interference waveform parameters of the jammer.For example, when synthesizing sweep interference,the jammer generally integrates the dwell time options of 1 ms,2 ms, 4 ms and 8 ms.Similarly, the storage duration of DRFM interference is generally integrated with options of 10 ms, 20 ms and 30 ms, and the storage depth is sufficient to ensure that it is larger than the working period of the fuze.Through theoretical analysis and experimental verification, the interference form synthesized based on these interference parameters can successfully interfere with the fuze.

The purpose of defining the central parameter in this paper is to give the action space a standard in the process of random parameter selection so that each set of interference parameters will not deviate too much from the central parameter.The selection of center parameters is combined with the commonly used parameter range of prefabricated interference waveform.The dynamic adjustment on this standard can achieve the maximum interference success rate according to the actual working environment.If only the central parameter is selected, the fuze can be jammed successfully in some cases.But through the simulation experiment,the interference success rate is lower if only the central parameter is selected.

The parameter search range is designed according to different stage states.When the fuze is within a safe distance, the range of the optional parameters of the model is large, as indicated by the white box in the figure.When the fuze is within the warning distance, the interference model will rapidly shrink the range of parameter selection.When the fuze is at a danger distance, all action space parameters will conform to the central parameters.

The largest search range is within the safe distance, which can maximize the function of the model in exploring the ratio of unknown interference parameters.The degree of freedom within the warning distance is low, and the model can only select the interference parameters within the range of the central parameters of the matrix.Because only the prefabricated suppression interference is synthesized within the dangerous distance, the parameter selection range is consistent with the central parameter of the matrix.

The dynamic degree of freedom design method proposed in this paper can reduce most invalid interference parameter ratios.Based on the purpose of the 3 states, the model can achieve a balance between exploration efficiency and interference efficiency.

To increase the randomness of the different parameters during synthesis, the offset range of the interference parameters is added to the design of the motion space in this paper.The central parameter is input into the middle position of each distance state as a specific value, and the remaining parameters are based on the offset range of the central parameter, as shown in Fig.8.

In order to maximize the possibility space of real-time performance and parameter matching and improve the randomness of the interference system during testing,the parameters selected by the model are not specific values but are based on the offset range of the central parameter,and the parameters are randomly output within this range.

For example, the offset ranges of 4 parameter matrices are extracted in Fig.8.It is assumed that the estimated main frequencies of the parameters in the front end of the algorithm are 10 GHz and 12 GHz.Since errors cannot be calculated in real time and the fuze frequency may be deflected,the model will randomly select ranges of-5 MHz and-10 MHz based on 10 GHz and 12 GHz to control the output frequency.If -10 MHz is selected, the frequency value is randomly selected as a parameter between 10 GHz and 9990 MHz.

Fig.8.Schematic diagram of the parameter offset range based on prior knowledge.

Fig.9.Offset fitting curve of the action space.

If specific parameter values are set, the randomness of the model will be lost,and the scale of the action space matrix will be dramatically increased.The parameter design method based on the parameter deviation range balances the randomness and real-time performance.The range of parameter deflection is not linear.In most cases,a small range of linear deflection in the area closest to the central parameter can meet the conditions of fuze detonation.However,in a few cases,the parameters of the fuze will have a large range of parameter deflection,or the interference form synthesized by the interference system using common parameters has difficulty breaking through the anti-interference means of the fuze.If the interference fails within the safe distance,the model will abandon the nonlinear region and quickly force the search range to converge to the linear region close to the prior knowledge.Combined with the characteristics of linear and nonlinear regions, a large number of numerical simulation experiments were carried out in this paper,and the fitting curve was used to construct the range of the random parameter offset in Fig.9.After polynomial fitting, the offset equation of the action space is given as

In Eq.(2),ydeis the offset,and x is the number of columns of the matrix in which the current space is located.

Fig.9 shows the fitting curve of the migration range of the motion space.To improve the fitting accuracy,4th power terms are added to the equation.The central position of the X-axis (x = 6 in the figure) is prior knowledge, and both sides of the fitting curve center are symmetric.

3.3.Design of the reward function

In practical applications, the airborne self-defense interference system can obtain the range information of the interference object through the airborne target indication radar and update the range status in real time.When obtaining the distance information, the interference system outputs the interference form, and the feedback system repeatedly observes the fuze initiation state.If the signal disappears, the system determines that the fuze is started;otherwise, the interference fails.When the interference fails, the model enters the next state according to the current radar distance and interferes again.The above steps are repeated until the dangerous distance is passed, at which point the jammer will directly adopt the suppression interference form.In a single state space, the interference system continuously releases the same interference form.

The weighted reward function designed in this paper adopts the following principles.First,based on the interference characteristics of the radio fuze,the reward function calculates the reward for the interference forms that meet the starting conditions.The maximum reward weighting should yield the form with the greatest interference distance.The secondary reward weighting yields an interference pattern with lower transmitting power; the interference form with the fewest parameters is given the highest score.In this way, the interference forms with the greatest interference distance, the least transmitting power, and the fewest parameters can be comprehensively screened.

In the process of missile rendezvous, the interference system obtains the number of parameters C,interference distance R,power value P,and starting value I of 10 interference forms and calculates the reward value of each interference form.The optimization objective of the interference system is to successfully start the fuze at the farthest distance with the minimum power and the minimum number of motion space dimensions.

The interference distance corresponds to the distance state.The reward function is designed according to the three-stage states.

R is the distance state in which the current interference form is output, and the value of R decreases by a step value of 0.1.At the greatest distance,R is 1,and at the smallest distance,R is 0.1.P is the power score when the current interference form is output.After a missile rendezvous experiment, the model ranks the output powers of all interference forms.The maximum power is assigned 0.1 point, and this 0.1 point is increased step by step according to the power value from large to small.The minimum-power score is the highest, and 1 point is assigned.C is the number of action spaces of the current interference form.The number of parameters of the DRFM is 8,and the value is 0.2 points.For each action space reduction, the value is increased by 0.1 points.Since five interference forms are adopted in this paper, according to this definition method, the C values of AM, FM, and sweep interference are 0.5,and the C value of noise amplitude modulation interference is 0.8.Since the prefabricated interference mode also requires the interference system to be synthesized based on hardware,but only fixed the value of interference parameters, the score of the waveform complexity score should be consistent with the calculation method of the random interference form proposed in this paper.I is the starting score used to screen successful interference forms; if the fuze is started,the value is 2 points.Eq.(3)is divided into the safe distance, warning distance,and danger distance.In different stage states, the compression of the weighting coefficient can significantly reduce the reward value,thus increasing the differentiation of the reward value.In the same phase state,interference forms can also be refined and compared in terms of phase distance, power,and the number of parameters through Eq.(3).

3.4.Strategies for modifying the model to accommodate different projectile speeds and fuze startup distances

The following two situations are discussed respectively:

(1) When the projectile speed is higher than 4Mach and the boot distance of the fuze is less than 1000 m:

In this case,the interference distance becomes shorter,and the higher projectile velocity puts forward higher requirements for the real-time performance of the interference system.Although the total interference distance becomes shorter, the number of state spaces can be reduced, and the distance covered by a single state space can be increased.Taking the distance partition in Fig.10 as an example, if the projectile velocity is 6Mach and the covering distance is 150 m, the time window of a single state space is about 72 ms,which is consistent with the experimental parameters set in this paper.Real-time performance can be ensured on the premise of constant interference resources.However, only 4 interference forms can be generated in one missile rendezvous process, which may reduce the success rate of interference.More experiments of missile rendezvous are needed to obtain the optimal interference forms and parameters.

Fig.10.State space division when the power distance of fuze is 600 m.

Action space can reduce the range of parameter selection and remove the interference form with high hardware resource occupancy rate, to further improve the real-time performance.

(2) When the projectile speed is lower than 4Mach and the boot distance of the fuze is further than 1000 m:

Compared with the conditions set in this paper, the distance that can be interfered becomes longer, and the low projectile velocity gives the interference system more reaction time.In this case,the number of state spaces can be increased, and the covered distance can be reduced.Take the state space in Fig.11 as an example,16 state spaces, each covering 75 m.If the missile velocity is 3.5Mach, the width of a single state space time window is 61 ms.Increasing the number of state spaces can obtain more interference forms in a missile rendezvous experiment and achieve a higher interference success rate in practical application.

On the premise of ensuring real-time performance,the scope of parameter selection can be appropriately increased in the action space, to further improve the optimization degree of interference forms.

4.Simulation verification

4.1.Experimental environment settings

This paper uses MATLAB simulation software to simulate missile rendezvous, test the model performance, and verify the effectiveness of the interference strategy on a fuze.The fuze system uses a frequency-modulation fuze,which is commonly used in air defense missiles, as it is a radio fuze with strong anti-interference ability.The starting condition of the frequency-modulation fuze is relatively harsh for various interference forms so that the effectiveness of the interference forms can be determined.To simulate the deviation of the working parameters of different fuze products in actual situations, the center frequency parameter of the fuze is offset by 10 MHz, and the modulation bandwidth parameter is offset by 2 MHz.The experimental environment parameters are set as shown in Table 2.

In this paper, the steps for simulation experiment verification are as follows:

(1) Conduct a simulation experiment with full-range missile rendezvous to verify the function of the model and verify the effectiveness of the reward function by calculating and comparing the interference performance of each interference form on the fuze.

(2) Conduct 20 simulation experiments with full-range missiletarget rendezvous, aiming to verify the corresponding situation between the parameter selection range of the action space and the distance of the missile target.

Table 2 Parameter settings of the simulation experiment of missile rendezvous.

Table 3 Single full-range interference form parameters.

(3) Conduct 1000 simulation experiments with full-range missile rendezvous.Based on the maximum value of the reward function designed in this paper, the purpose is to verify the differentiation of the reward function for a large number of interference forms.

(4) Based on the experimental results in Eq.(3), optimal interference forms at different distances can be obtained,and the corresponding interference forms can be released by stepping distance to obtain various interference strategies.

4.2.Simulation experiment with single full-range missile rendezvous

A full-range missile rendezvous interference simulation experiment for a frequency-modulated fuze was carried out.The parameters of the model under different search ranges were selected as shown in Table 3.Within the safe distance, the offset of the modulation depth and transmitting waveform amplitude was large,leading to interference failure.

Fig.12 shows the starting condition of the frequency modulation fuze, where x1-x10 are the distance states, ×1 is the 1000 m-900 m distance, and so on.The fuze is started 5 times,which proves that the parameter configuration of the model is effective.The horizontal axis is the number of time sampling points corresponding to the different projectile distances.

Fig.13 lists the range distribution of the frequency dimension of the 10 interference waveforms.There are frequency errors in the radio fuze from the frequency offset of the fuze itself and the parameter estimation error.Taking sweep-frequency interference as an example, the frequency bandwidth should be set slightly larger than the frequency bandwidth of the fuze itself to offset the frequency error, but too large a bandwidth will reduce the interference power of a single frequency point.Suppose aiming interference is selected within the exploration distance.In that case,although the energy is relatively concentrated,it is difficult to align the main frequency position, which is one of the reasons why AM aiming interference has difficulty starting the FM fuze in some cases.The DRFM interference signal is the same as the fuze signal,but the storage time and slicing time need to be configured reasonably.Employing the AM-modulated DRFM interference waveform within the warning distance can cause more energy to enter the fuze and improve the startup probability.

Fig.12.Initiation of the frequency modulation fuze.

The reward value is calculated according to Eq.(3).An interference form that does not start the fuze receives 0 points,and the results are shown in Table 4.

Fig.13.Distribution range of the frequency parameters.

Table 4 Interference reward assignment.

It can be seen from Table 4 that the reward values of the safety distance, warning distance, and dangerous distance are clearly distinguished, and the waveform scores of the first and fourth interference types are high.The optimal interference form should have a combination of long distance, low power, and few parameters.The first interference form and parameter ratio obtain the highest score in this missile rendezvous experiment.It is worth noting that the fourth interference form also achieved a high score.Compared with the first interference form, the fourth uses lower interference power, and the distance is only 300 m shorter than that of the first interference form.However, although interference synthesis forms 7 and 8 also successfully started the fuze,there is a high degree of differentiation between the score order and the safe distance, which proves that the reward function designed in this paper can effectively distinguish the interference forms in each state and has a certain comprehensive evaluation ability.

4.3.Verification of the action space parameter selection range

Using the model designed in this paper,simulation experiments for missile rendezvous were conducted 20 times.The corresponding diagram of the final selected parameter position in the action space and the distance between the missile and the target was obtained, as shown in Fig.14.(Main Frequency).

In the figure, the values 6-10 on the horizontal axis represent the safe distance.Within the safe distance range, the selection range of the action space is large,showing a large range of discrete phenomena, which can increase the diversity of parameter matching forms.The values 3-6 represent the warning distance.The selection range of the action space is relatively narrow and fluctuates around the central parameter.The values 1-2 represent the dangerous distance,and the selection range of the action space fits the central parameter.The stage distance state set in this paper changes the parameter selection range according to the projectile distance.

Fig.14.Mapping between the parameter selection range and projectile distance.

4.4.Differentiation verification of the reward function in multiple interference experiments

To verify the optimal interference pattern reward of the results of multiple interference experiments, 1000 missile rendezvous experiments were conducted, and 10,000 interference patterns were synthesized.According to the different projectile distances,the distribution of the reward function obtained in this paper is shown in Fig.15.

Each point in the figure is the reward value of the corresponding interference form, and the red line connects equal reward values.Fig.16 shows the reward function designed in this paper can effectively distinguish the interference form values at multiple distances.In different distance states,the distribution of the reward values is more uniform, the maximum value of each state has an upper limit and a lower limit that can again distinguish 13-20 different levels of interference forms.

4.5.Multi-interference-form strategy

Based on the results of the offline simulation experiments in this paper, interference forms with high reward values within each distance can be extracted.These forms can be combined and released according to the projectile's distance to improve the success rate of interference.The interference forms at different distances extracted by the reward function are shown in Table 5.

Fig.17 shows the successful startup count of multiple interference types for the frequency-modulated fuze caused by the interference strategy proposed in this paper and other single-type interference strategies.One hundred full-range interference simulation experiments were conducted for each interference strategy.It can be seen from the figure that the interference pattern optimization model designed in this paper can interfere with the frequency modulation fuze more efficiently.The interference success rate of a single interference mode for the FM radio fuze is low,and it is difficult to ensure the safety of the onboard aircraft.The distancebased diversified interference strategy can significantly improve the starting probability of the fuze.The use of specific interference forms in different distance states can increase interference efficiency, reasonably utilize the advantages and characteristics of different interference forms and optimize interference resources.

Fig.15.Distribution diagram of reward values for multirange missile rendezvous.

Fig.16.Ladder diagram of maximum values of the reward function.

Fig.17.Comparison of the startup times between the interference strategy proposed in this paper and other single type interference strategies.

5.Conclusions

To improve the success rate of air defense methods in interference missile radio fuzes in a complex battlefield environment and ensure that the interference system can adapt to various workingenvironments, in this paper, the reinforcement-based Q-learning algorithm is adopted to obtain a decision-making method for multiple radio fuze interference forms based on optimal patterns obtained from offline experiments and implemented online.

Table 5 Optimal distance-based interference form combination strategies.

First,the advantages of the Q-learning algorithm applied in fuze interference and the problems to be solved are analyzed.Second,to address these problems, the state space based on the missile's distance is designed, and the three stages of safe, warning, and dangerous distances are set for 10 distance states.A multidimensional action space is designed according to the quantity of parameters to be set for different interference forms.The parameters are searched randomly in the action space.To reduce the proportion of interference forms,different parameter search ranges are set based on the phase state.Finally, a reward function weighted by distance, transmitting power, and the quantity of interference pattern parameters is designed.The highest scores are given to the interference form with the greatest interference distance,the least interference power, and the fewest parameters.

The simulation experiments verify that the interference decision method proposed in this paper can obtain an interference form with a high success rate at different distances.For a certain system's fuze and working environment, the optimal interference forms at various distances can be obtained by means of multiple offline interference experiments.To meet various requirements, the interference system can handle most working environments and yield effective interference.Compared with the single interference mode, the method proposed in this paper can greatly improve the efficiency of interference a fuze.

Funding

Supported Project: National Natural Science Foundation of China (61973037); National 173 Program Project (2019-JCJQ-ZD-324).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.