MODELING COUNT DATA WITH OVER-DISPERSION USING GENERALIZED POISSON REGRESSION: A CASE STUDY OF LOW BIRTH WEIGHT IN INDONESIA

: Poisson regression is commonly used in modeling count data in various research fields. An essential assumption must be met when using Poisson regression, which is that the count data of the response has the mean and variance must be equal, namely equip-dispersion. This assumption is often unmet because many data for the response that the variance is greater than the mean, called over-dispersion. If the Poisson regression model contains the over-dispersion, then will be produced an invalid model can under-estimate standard errors and misleading inference for regression parameters. Therefore, an approach is needed to overcome the over-dispersion problem in Poisson regression. The generalized Poisson regression can handle the over-dispersion in Poisson regression. This study aims to obtain the generalized Poisson regression model and the factors affecting the low birth weight in Indonesia in 2021. The result shows that the factors affecting the low birth weight in Indonesia based on the generalized Poisson regression model were: poverty rate, percentage of households with access to appropriate sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received blood-boosting tablets, and percentage of antenatal care


INTRODUCTION
Poisson regression is widely used in modelling count data. Count data is one type of statistical data that shows the number of events over a particular time and can only be positive [1]. An essential assumption must be met in Poisson regression modelling; namely, the mean and variance of the response must be equal, called equid-dispersion [2]. This assumption is often unmet because, in many data in various research fields, the variance is greater than the mean, called over-dispersion. An invalid model can underestimate standard errors and misleading inferences for regression parameters [1]. Therefore, an approach is needed to overcome the over-dispersion problem in Poisson regression. The generalized Poisson regression is an alternative approach for handling it [2]. Several studies that model count data with over-dispersion using generalized Poisson regression have been proposed. The maximum likelihood and moment methods were used to estimate the generalized Poisson regression model parameters. In contrast, the significance test of the parameters was used by the likelihood ratio test method [3]. The restricted generalized A baby's weight at birth is the most crucial determinant of the chances of survival, growth, and development in the future. Mothers who continually maintain their health by consuming nutritious food and adopting a good lifestyle will give birth to healthy babies. In contrast, mothers who experience nutritional deficiencies have a risk of giving birth to babies with low body weight. The low birth weight reflects the health and nutrition situation and shows the level of survival and psychosocial development [7]. Babies with low birth weight have a higher risk of experiencing death, growth retardation, and development during childhood than babies who are not low birth weight [8]. Some of the factors that cause low birth weight are pregnant women experiencing chronic energy shortages, poor antenatal care, poverty, and poor sanitation [9], [10], [11].
This study aims to obtain the generalized Poisson regression model, the factors affecting modelling count data with overdispersion, and its application in low birth weight in Indonesia, in 2021. Following [3], [12], [13], [14], the Poisson regression and generalized Poisson regression models can be obtained by the maximum likelihood and Fisher-scoring methods. In contrast, the test of significant parameters of the Poisson regression and generalized Poisson regression models can be used by the likelihood ratio test and Wald test methods.

LITERATURE REVIEW 2.1. Low Birth Weight
Low birth weight is a baby born weighing less than 2,500 grams. Low birth weight consists of low birth weight (1,500-2,499 grams), very low birth weight (1,000-1,499 grams), and extremely low birth weight (< 1,000 grams). 60 to 80 percent of the infant mortality rate is due to low birth weight. Low birth weight has a greater risk of experiencing morbidity and mortality than babies born with normal weight. A gestation period of less than 37 weeks can cause complications in the baby due to the imperfect growth of the organs in the body. The lower the baby's weight, the more crucial it is to monitor its development in the weeks after birth. Low birth weight can be caused by two factors: premature birth and Intra Uterine Growth Restriction (IUGR), commonly called impaired fetal growth. Low birth weight can cause morbidity and even death [7].

Poisson Regression
Poisson regression is a nonlinear parametric regression model. The response of Poisson regression model ( ) follows the Poisson distribution, which has the probability mass function defined as [2]: where ! is the parameter and ! > 0. ( ) = ! and ( ) = ! , respectively, symbolize the mean and variance of the Poisson distribution. Suppose there are covariates namely ! , # , … , $ , then the Poisson regression model can be written as follows [2]: !$ ] % is the vector of regression parameters. is the number of covariates. ! ( ) is the link function that depends on the covariates [15]. The Poisson regression model in Equation (2) can be obtained by estimating the model's parameter using the maximum likelihood method. The estimation begins with obtaining the likelihood and log-likelihood functions as follows: It maximizes the log-likelihood function in Equation (4) by determining the first partial derivative of the log-likelihood function with respect to the estimated parameter and then equating it with zero, Based on Equation (5), the result of the first partial derivative of the log-likelihood function with respect to the estimated parameters produces an implicit function. Therefore, a numerical approach is needed to obtain the maximum likelihood estimator of the PR model parameters. One numerical approach is the Fisher-scoring method [12]. The Fisher-scoring algorithm for obtaining the maximum likelihood estimator of the Poisson regression model parameters is as follows: 1) Determine the initial value for K ! , namely K 2) Determine the tolerance value, symbolized by for the iteration process stopping. 3) Start the iteration process using the following formula: 4) where ( ! ) is the gradient vector, which has the elements in Equations (5). ( ! ) is the information matrix and expressed as is the second partial derivative of the log-likelihood function with respect to the estimated parameters as follows: 5) The iteration process stops at the -th iteration when converged, namely X K The significance test on the Poisson regression model parameters aims to get the covariates affecting the response simultaneously and partially. The likelihood ratio test method is applied to the simultaneous test using the hypotheses: The test statistic used to test the hypothesis in Equation (7) is Wilk's lambda statistic which can be obtained by the likelihood ratio test method, and is formulated as follows [13].
where ℓ( l ! ) is the maximum value of the log-likelihood function for the set of model parameters under the null hypothesis ( & ) and ℓJΩ K ! L is the maximum value of the loglikelihood function for the set of model parameters under the population are as follows: Wilk's lambda statistic in Equation (8) is asymptotically chi-square distributed [16]. Therefore, the rejected region of & (the critical region) on the significance level ( ) to test the hypothesis in Equation (7) is the null hypothesis rejected if the value of ! # greater than the value of (1,$ " ) # or the null hypothesis is rejected when the -value is less than the value. ! is the degree of freedom being the difference between the number of model parameters under the population and the null hypothesis, namely ! = ( + 1) − 1 = [13].
The parameter hypothesis testing carried out after the simultaneous test is a partial test. The hypothesis used for the partial test is: The test statistic for testing the hypothesis in Equation (9) is Wald's statistic, formulated as follows [13].
where O !0 is the estimated value of the maximum likelihood parameter of the Poisson regression model obtained by the Fisher-scoring method in Equation (6).
is the maximum likelihood standard error estimate of the parameter of the Poisson regression model obtained from the main diagonal elements of the variance-covariance matrix, Wald's statistic in Equation (10) is asymptotically standard normal distributed [16] so that the critical region at the significance level to test the hypothesis in Equation (9) is the null hypothesis is rejected if the value of | ! | is greater than the value of 1 # ⁄ or the null hypothesis is rejected if the -value is less than the value.

Over-dispersion
Over-dispersion is one of the most common problems in Poisson regression. The Poisson regression assumes the count data has the same variance value as its mean (equidispersion) [5]. Sometimes the count data contains over-dispersion, shown by the variance greater than the mean, namely ( ) > ( ). Over-dispersion occurs due to unobserved sources of variability in the data or the effect of other variables that result in the probability of an event occurring depending on previous events. Over-dispersion can lead to underestimating the standard error, resulting in under-estimated parameters and the significance of the covariate effect being over-estimated. Over-dispersion in Poisson regression can be detected by the deviance divided by the degrees of freedom. If the value is greater than one, it is shown that there is over-dispersion [1].

Generalized Poisson Regression
Generalized Poisson regression is a development of the Poisson regression model. The generalized Poisson regression model can deal with under-dispersion and over-dispersion problems in Poisson regression [2]. The response ( ) of the generalized Poisson regression model has a generalized Poisson distribution with the probability mass function defined as follows [17]: where # and are the parameters, for # > 0 and > 0. ( ) = # and ( ) = # (1 + # ) # , respectively define the mean and variance of the generalized Poisson distribution.
Based on Equation (11), the generalized Poisson regression model can be written as follows [3]: where # ( ) is the log link function that depends on the covariates. is the parameter vector, and % is the vector of covariates, which are defined by , respectively. The generalized Poisson regression model in Equation (12) can be obtained by estimating the model parameters using the maximum likelihood method [7]. The initial step is forming the likelihood and log-likelihood functions. Suppose = [ # % ] % , then the likelihood and log-likelihood functions are formulated as follows: y. The next step is maximizing the log-likelihood function in Equation (14) by determining the first partial derivative of the log-likelihood function for the estimated parameters is then equated to zero, The maximum likelihood parameter estimator of the generalized Poisson regression model in Equations (15) and (16) is an implicit function. Therefore, the maximum likelihood estimator cannot be obtained explicitly and requires a numerical approach. As in the Poisson regression model, a numerical approach with the Fisher-scoring method is used to obtain the maximum likelihood parameter estimator of the generalized Poisson regression model. The Fisher-scoring algorithm used is as follows: 1) Determine the initial value for K (&) .
where ( ) is the gradient vector, which has the elements in Equations (15) and (16). ( ) is the information matrix and defined as where the # ℓ( ) % ⁄ is the second partial derivative of the log-likelihood function with respect to the estimated parameters as follows: Dy.
( 19) 5) The iteration process stops when convergent conditions are met, namely | K (,-!) − K (,) | ≤ , where is the smallest positive number. The maximum likelihood parameter estimator of the generalized Poisson regression model was obtained from the last iteration.
If the maximum likelihood parameter estimator of the generalized Poisson regression model has been obtained, then parameter hypothesis testing can be carried out. This test consists of a simultaneous test and a partial test. The simultaneous test is used to determine the effect of the covariates on the response simultaneously. In contrast, the partial test is used to determine the effect of each covariate on the response individually. The hypothesis for the simultaneous test is: The test statistic used to test the hypothesis in Equation (20) is Wilk's lambda statistic ( # # ) which can be obtained by the likelihood ratio test method and is formulated as follows [2]: where ℓ( l # ) and ℓJΩ K # L are the values of maximum log-likelihood function under the null hypothesis and population, respectively. The ℓ( l # ) and ℓ( l # ) are obtained by: Wilk's lambda ( # # ) statistic in Equation (21) is asymptotically chi-square distributed [16]. Therefore, the critical region of the null hypothesis in Equation (20)  ) or the -value is less than , where # is the degrees of freedom, which is # = ( + 2) − 2 = .
The next test is the partial test. The Wald test method is used for this test that has the hypothesis is: The statistical test for testing the hypotheses in Equation (22) is Wald statistic, and formulated by where O #0 is the estimated value of the maximum likelihood parameter of the generalized Poisson regression model obtained by the Fisher-scoring method in Equation (17 The Wald statistic ( # ) in Equation (23) is asymptotically standard normal distributed [16] so that the critical region at the significance level ( ) to test the hypothesis in Equation (22) is the null hypothesis is rejected when the value of # is greater than the value of 1 # ⁄ (i.e., | # | > 1 # ⁄ ) or the null hypothesis is rejected when the -value is less than the value.

Data Sources
The data in this study is secondary data obtained from the Ministry of Health of the Republic of Indonesia [18] and the Central Statistics Agency of the Republic of Indonesia [19]. This research unit is all provinces in Indonesia in 2021, namely 34 provinces.

Research Variables
The research variables used in this study contain the response ( ) and the covariates J 0 L, for = 1,2, … ,8, which are presented in Table 1.

Data Analysis Techniques
The techniques of data analysis in this study are as follows:

Statistical Descriptive Analysis
Analyzing and modeling the low birth weight in Indonesia using generalized Poisson regression begins with the descriptive statistical analysis of research variables. The results are shown in Table 2.  Table 1 shows that Indonesia's average low birth weight in 2021 was 3,286, with a standard deviation of 4,775. The highest and lowest, 22,574 and 177, were found in West Java Province and North Sulawesi Province, respectively. One of the reasons for the high low birth weight in West Java Province compared to North Sulawesi Province is the larger population in West Java Province. The visualization of the distribution of low birth weight in Indonesia in 2021 is presented in Figure 1.

Detecting Multicollinearity
Multicollinearity is a problem in the generalized Poisson regression modeling; namely, the covariates are correlated to each other. This study's multicollinearity detection uses the Variance Inflation Factor (VIF) [20]. The generalized Poisson regression model has a multicollinearity problem when the VIF value of covariates is greater than 10. The VIF value of all covariates given in Table 3 shows that all covariates have a VIF value of less than 10. Therefore, there is no multicollinearity, and all of covariates can model low birth weight using the generalized Poisson regression model.

Modeling Low Birth Weight Using Poisson Regression
The modeling of low birth weight in Indonesia in 2021 using Poisson regression begins with estimating and significance testing of the Poisson regression model parameters, were displayed in Table 4. Based on Table 4, the Poisson regression model was obtained, and can be written as follows: ~! ( ) = 1.7986 + 0.0524 ! + 0.0067 # − 0.0011 5 + 0.0433 6 −0.0573 7 − 0.0266 8 + 0.0585 9 + 0.0446 : .
The simultaneous influence of the hypothesis was carried out using Wilk's lambda statistic in Equation (8). The hypothesis was formulated as follows: The Wilk's lambda statistic value was 80,105.23, and the (1,$ " ) # value was 13.3616 with a -value was less than 0.001 (i.e., < 0.001). Therefore, the null hypothesis was rejected and it can be concluded that the poverty rate, percentage of households occupying livable houses, percentage of food processing places that meet the requirements according to the standard, percentage of households that have access to safe drinking, percentage of households that have access to proper sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received blood- boosting tablets, and percentage of antenatal care were simultaneously significantly influencing the low birth weight in Indonesia.
The partial test was used to obtain covariates that significantly influencing the low birth weight in Indonesia. This test was employed by the Wald statistic in Equation (10), which has the hypotheses as follows: Based on Table 4, the Wald statistic value for all parameters (| ! |) was more than the value of 1 # ⁄ , and the -value for all parameters was less than the value. Therefore, the null hypothesis was rejected, and the conclusion was poverty rate, percentage of households occupying livable houses, percentage of food processing places that meet the requirements according to the standard, percentage of households that have access to safe drinking, percentage of households that have access to proper sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received blood-boosting tablets, and percentage of antenatal care were partially significantly influencing the low birth weight in Indonesia.

Detecting Over-dispersion
Overdispersion detection is done by comparing the variance value of the response to the average value and the deviation value of the Poisson regression model divided by the degree of freedom. If the variance value of the response is more than the average value, then overdispersion occurs. Meanwhile, if the value of the deviation divided by the value of degrees of freedom is more than 1, then overdispersion occurs. Based on the descriptive statistical analysis result in Table 2, the variance value of low birth weight was 22,797,386 and the average value was 3,286. Since the variance value is more than the average value, overdispersion occurs. Based on the results of Poisson regression modeling, the deviance value was 58,108.76, and the degrees of freedom value was 25. The deviance divided by the degrees of freedom were more than 1. These results indicate overdispersion and show that there is overdispersion in Poisson regression.

Modeling Low Birth Weight Using Generalized Poisson Regression
Since there is a problem of overdispersion in Poisson regression, the modeling of low birth weight in Indonesia in 2021 needs to be adequately modeled using Poisson regression. Therefore, generalized Poisson regression is one of the appropriate models to model. The result of modeling low birth weight in Indonesia in 2021 using generalized Poisson regression is presented in Table 5. The generalized Poisson regression can be obtained based on the parameter estimates results in Table 5, and it was expressed as follows: ~# ( ) = 4.4497 + 0.0494 ! − 0.0014 # + 0.0012 5 + 0.0246 6 −0.0340 7 − 0.0193 8 + 0.0374 9 + 0.0291 : .
The simultaneous influence of the hypothesis was carried out using Wilk's lambda statistic in Equation (21). The hypothesis was formulated as follows: Wilk's lambda statistic value was 15.0237, and the (1,$ # ) # value was 13.3616 with avalue of 0.0587. Therefore, the null hypothesis was rejected, and the conclusion was poverty rate, percentage of households occupying livable houses, percentage of food processing places that meet the requirements according to the standard, percentage of households that have access to safe drinking, percentage of households that have access to proper sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received blood-boosting tablets, and percentage of antenatal care were simultaneously significantly influencing the low birth weight in Indonesia.
The partial test was used to obtain covariates significantly influencing the low birth weight in Indonesia. The Wald statistic in Equation (23) was applied in this test, which has the hypothesis as follows: The Wald statistic value (| # |) of the parameters of #! , #7 , #8 , #9 , and #: in Table  5 was greater than the value of 1 # ⁄ with the -value was less than the value. Therefore, the null hypothesis was rejected, and the conclusion was poverty rate, percentage of households that have access to proper sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received bloodboosting tablets, and percentage of antenatal care were partially significantly influencing the low birth weight in Indonesia.
Finally, the interpretation of the generalized Poisson regression model in Equation (25), especially for the significant covariates are as follows: 1) If the poverty rate ( ! ) increases by 1%, the average low birth weight will increase by exp(0.0494) or 1.0506 times, where the other covariates are fixed. 2) If the percentage of households with access to proper sanitation ( 7 ) increases by 1%, then the average low birth weight will decrease by exp(-0.0340) or 0.9666 times, where the other covariates are fixed. 3) If the percentage of pregnant women at risk of chronic energy deficiency receiving additional food ( 8 ) increases by 1%, then the average of low birth weight will decrease by exp(-0.0193) or 0.9809 times, where the other covariates are fixed. 4) If the percentage of pregnant women who received blood-boosting tablets ( 9 ) by 1%, then the average of low birth weight will increase by exp(0.0374) or 1.0381 times, where the other covariates are fixed. 5) If the percentage of antenatal care ( : ) increases by 1%, then the average low birth weight will increase by exp(0.0291) or 1.0295 times, where the other covariates are fixed.

CONCLUSION
Generalized Poisson regression is an accurate regression technique for modeling and handling count data with overdispersion in Poisson regression. The generalized Poisson regression was developed from the generalized linear models. The maximum likelihood and Fisher-scoring methods were used to estimate the generalized Poisson regression model parameters, whereas the likelihood ratio test and Wald test methods can be employed to test the significance of parameters. The generalized Poisson regression model was applied to modeling low birth weight in Indonesia in 2021. The factor affecting the low birth weight in Indonesia based on the generalized Poisson regression model were: poverty rate, percentage of households with access to appropriate sanitation, percentage of pregnant women at risk of chronic energy deficiency receiving additional food, percentage of pregnant women who received blood-boosting tablets, and percentage of antenatal care. However, this study still needs to be continued by using a spatial regression approach for future research, such as geographically weighted generalized Poisson regression, because there is any spatial heterogeneity in the generalized Poisson regression model.