STATISTICS-BASED PREDICTIONS OF CORONAVIRUS EPIDEMIC SPREADING IN MAINLAND CHINA

Background. The epidemic outbreak caused by coronavirus COVID-19 is of great interest to researches because of the high rate of the infection spread and the significant number of fatalities. A detailed scientific analysis of the phenomenon is yet to come, but the public is already interested in the questions of the epidemic duration, the expected number of patients and deaths. Long-time predictions require complicated mathematical models that need a lot of effort to identify and calculate unknown parameters. This article will present some preliminary estimates. Objective. Since the long-time data are available only for mainland China, we will try to predict the epidemic characteristics only in this area. We will estimate some of the epidemic characteristics and present the dependencies for victim numbers, infected and removed persons versus time. Methods. In this study we use the known SIR model for the dynamics of an epidemic, the known exact solution of the linear differential equations and statistical approach developed before for investigation of the children disease, which occurred in Chernivtsi (Ukraine) in 1988–1989. Results. The optimal values of the SIR model parameters were identified with the use of statistical approach. The numbers of infected, susceptible and removed persons versus time were predicted and compared with the new data obtained after February 10, 2020, when the calculations were completed. Conclusions. The simple mathematical model was used to predict the characteristics of the epidemic caused by coronavirus in mainland China. Unfortunately, the number of coronavirus victims is expected to be much higher than that predicted on February 10, 2020, since 12289 new cases (not previously included in official counts) have been added two days later. Further research should focus on updating the predictions with the use of up-to-date data and using more complicated mathematical models.


Introduction
Here, we consider the development of an epidemic outbreak caused by coronavirus COVID-19 (the previous name was 2019-nCoV) (see e.g., [1][2][3]). Since long-term data are available only for mainland China, we will try to predict the number of coronavirus victims V (number of persons who caught the infection and got sick) only in this area. The first estimations of V(t) exponential growth versus time t, typical for the initial stages of every epidemic (see e.g., [4]) have been done in [3]. For long-time predictions, more complicated mathematical models are necessary. For example, a susceptible-exposed-infectious-recovered (SEIR) model was used in [2]. Nevertheless, complicated models need more effort for unknown parameters identification. This procedure may be especially difficult if reliable data are limited.
In this study, we use the known SIR model for the dynamics of an epidemic [4][5][6][7][8]. For the parameter identification, we will use the exact solution of the SIR set of linear equations and statistical approach developed in [4] (tested also in [9]). These methods were applied for investigation of the children disease, which occurred in Chernivtsi (Ukraine) in 1988-1989. We will estimate some of the epidemic characteristics and present the dependencies for victim numbers, infected and removed persons versus time.

Data
We shall analyze the daily data for the number of confirmed cases in mainland China, which origins from the National Health Commission of the People's Republic of China [1]. A part of the official diagram (its version, presented on February 15, 2020) is shown in Fig. 1. For calculations, we have used the data for the period of time from January 16 to February 9, 2020. The numbers shown after February 9 were used for verification of predictions.

Exact solution of SIR equations
The SIR model for an infectious disease can be written as follows [6,7]: , The number of susceptible persons is S, infected (persons who are sick and spread the infection) -I, removed (persons who do not spread the infection anymore, this number is the sum of isolated, recovered and dead people) -R; the infection and immunization rates are  and  respectively.
It follows from (1) and (2) that Integration of (5) with the initial conditions (4) yields: Function I has a maximum at S  and tends to zero at infinity, see [6,7]. In comparison, the number of susceptible persons at infinity 0, S   and can be calculated with the use of (6) from a non-linear equation yields: Thus, for every set of parameters N, ,  ,  0 t and a fixed value of V the integral (10) can be calculated and the corresponding moment of time can be determined from (9). Then I can be calculated from (6) by putting S = N  V and function R from Statistical approach for parameter identification. Linear regression As in paper [4], we shall use the fact that the random function 1 ( , , ) F V N  has a linear distribution (see (9)). Then we can apply the linear regression (see [10]) for every pair of parameters N and  and calculate the corresponding values of 0 t and .  The optimal (the most reliable) values of N and  correspond to the maximum value of the correlation coefficient r (see [4,9]).

Results
Since we did not know and still don't know the values of Qj before February 12, 2020, we supposed that Vj = Wj and have done the calculations with the use of data for the time period from January 16 to February 9, 2020. The optimal values of the parameters are: The corresponding correlation coefficient is very high r = 0.997966487046645. The solution of (7) yields the value 45579.
The corresponding number of infected I, susceptible S and removed R persons versus time (starting from January 16, 2020) were calculated and shown in Fig. 2. The blue line represents the number of victims V = I + R and is in good agreement with "tested confirmed cases" Wj, reported by the National Health Commission of the People's Republic of China [1] (blue markers).

Discussion
Unfortunately, many cases have not been included in the official counts and have appeared in the official Table from [1] only on February 12 as "clinically diagnosed cases" Qj (see Fig. 1). Since the National Health Commission of the the People's Republic of China has proposed two different ways of registration of the same disease [1], Vj must be the sum of Wj and Qj , i.e. Vj = Wj + Qj (provided that no new methods of registering the same disease would appear). Values Wj after February 9 are shown in Fig. 3 by "stars". "Crosses" represent the sum Wj + Qj .
Since the optimal curve was obtained only with the use of Wj and the difference between Wj and Vj is very big (e.g., it was 12 289 persons on February 12, 2020), the predictions shown in Fig. 2 and reported in [11] are no longer relevant. To have better predictions, it is necessary to have exact Qjdata for the period before February 12. Blue markers show the "tested confirmed cases" W j , reported by the National Health Commission of the People's Republic of China [1]. The "circles" correspond to the points used for calculations (it was supposed that V j = W j ); "stars" -to the points used only for verification Circles" show the "tested confirmed cases" W j for the period from January 16 to February 9, 2020, [2]. These points were used to calculate the prediction curve. "Stars" correspond to the "tested confirmed cases" W j for the period from February 10 to February 14, 2020, [1]. "Crosses" represent the sum W j + Q j from [1]

Conclusions
The simple mathematical model was used to predict the characteristics of the epidemic caused by coronavirus in mainland China. The numbers of infected, susceptible, and removed persons versus time were predicted and compared with the new data obtained after February 10, 2020, when the calculations were completed. Unfortunately, many cases have not been included in the official counts and have appeared on February 12 only. It makes the predictions reported on February 10, 2020, no longer relevant. Further research should focus on updating the predictions with the use of corrected data and more complicated mathematical models.