Development of multiple linear regression model to predict cod concentration based on west tarum canal surface water quality data

. COD level indicates the organic matter pollution in water. COD level is normally measured using time-consuming and costly lab tests. A predictive analysis, such as Multiple Linear Regression, could be an option to make the COD measurement more effective. Objectives: This research aims to determine the parameter that can predict COD concentration using correlation analysis and develop a Multiple Linear Regression Model for predictive analysis on COD level in the West Tarum Canal surface water. Method and results: The surface water quality data used in this study are collected from the official website of PAM Jaya with a period from August 2017 to May 2020. The correlation analysis to determine the predictors is done in Microsoft Excel using the Pearson Product Moment Correlation Analysis. The predictors selected are TDS, SO 4 , and Fluoride. The water quality dataset is inputted to the R Studio and made the MLR model. The model is validated using t-Test. The result showed that all models in all intake points are not showing good prediction results, and the predictors showed no effect on the COD level. Conclusion: The Multiple Linear Regression is not a fit tool for predicting the COD in the West Tarum Canal surface water.


Introduction
The West Tarum Canal began operating in 1968 from the Curug Dam to the Ciliwung River of 69.8 km. It delivers raw water for Karawang Regency, Bekasi Regency, and Bekasi City's drinking, industrial and agricultural needs. It supplies 80% of the raw water to the citizens of Jakarta [1].
COD is defined as the number of oxygen equivalents absorbed by a strong oxidant in organic matter's chemical oxidation [2]. In the West Tarum Canal surface water, the COD level is reported exceeding the allowable level [3]. If the COD value is high, it indicates that the water is polluted by organic matter [4]. Four primary methods for measuring COD in water are included in APHA (1995): the titrimetric method, the closed reflux method, the open reflux method, and the closed reflux/colorimetric method [5].
This study would concentrate on designing statistical models to predict COD levels in order to make measuring COD levels more efficient and cost-effective. This study focuses on developing a Multiple Linear Regression model to do predictive analysis on COD level in the West Tarum Canal surface water using several predictors that will be determined later on. Maulani et al. (2016) had done a research about the effect of BOD, TSS, and Oil & Grease on COD level. They conducted a research using statistical tools to make an accurate prediction. They used correlation method and Multiple Linear Regression to do the prediction [6].

Method
The first step of conducting this study was determining the idea of study and doing some literature review to strengthen the research base. After that, the author did a data collection and then pre-processed it, so the data is cleaned and ready to be analyzed. The next step is to run a correlation test of COD with all parameters to obtain the correlation coefficient used in the predictor selection process. Then the author ran a Multiple Linear Regression analysis to the predictor and response variable to get the regression coefficient. Once the regression coefficient is obtained, it can predict the COD using the selected predictors.

Water Quality Data
The water quality data is a time-series data from August 2017 to May 2020, which was collected from the official website of PAM Jaya. The water quality data contains physical, chemical, and biological parameters. The intake points located in the West Tarum Canal, which are also used as the sampling point of this research, are Curug Dam, Bekasi Dam, Cawang, and Ciliwung River intake point.

Data Analysis
This research is using two software in the analysis, which is Microsoft Excel and RStudio. Microsoft Excel is used for correlation tests, while RStudio is used to carry out the regression analysis and t-Test.

Correlation Test
Correlation analysis is a statistical analysis used to assess the strength of the bond between two variables [7]. The correlation technique used in this research is Pearson product-moment correlation coefficient. Pearson product-moment correlation coefficient is the most widely used coefficient, the sign of which is r (usually called the Pearson r). This technique calculates the degree of linear correlation between above 0 represents a positive correlation (direct correlation), and a value below 0 represents a negative correlation (inverse correlation), as can be seen in figure 2 [8]. Based on COD's characteristics, the level of COD in water is affected by organic matter. The higher the COD level, the higher the organic matter contained in water [4]. COD also has a relationship with Dissolved Oxygen, and higher COD levels mean a higher level of oxidized organic material, which decreases the levels of dissolved oxygen (DO) [8,9].
However, in the water quality data collected from the PAM Jaya, DO data is not provided. Therefore, correlation coefficient analysis will be the tool to select COD predictors, and the chemical and physical characteristics of COD will be neglected.

Multiple Linear Regression Model and Predictive Analysis
Multiple Linear Regression is a statistical tool that uses several independent variables to predict a dependent variable's outcome. Multiple linear regression (MLR) is developed to model the relationship between the dependent and independent variables [11]. The independent and dependent variables are also called predictor and response variables, respectively [12]. The general equation of the Multiple Linear Regression model is shown below. Y is the response variable, Xk is the predictor variable, and βk is the Xk regression coefficient.

Significance Test (t-Test)
The t-test is mostly used to test the null hypothesis of the observed difference between the two means [8]. One of the comparative tests (compare means) is the paired sample t-test. This test is useful for testing two interrelated/correlated samples or "paired samples" from populations with the same average [13]. The paired sample t-test is also called the repeated measures t-test or the related t-test.
This t-test would be performed when the samples are related with commonly the same participants in each sample [14].
The t-test begins with determining the null hypothesis (H0) and the alternative hypothesis (Ha). The null hypothesis says that there is no difference between the means, while the alternative hypothesis says that there is some difference between the means.
To test the hypothesis, a significant value is used. If the significant value is below 0.05, the null hypothesis is rejected, and the alternative hypothesis is accepted. If the significant value is above 0.05, the null hypothesis is accepted, and the alternative hypothesis is rejected.

The Predictors
The coefficients for Curug Dam, Bekasi Dam, Cawang, and Ciliwung River intake points are shown in Tables 1, 2, 3, and 4, respectively. The correlation coefficients are rounded to 3 decimal places.
According to the correlation test result, the KMnO4, an Organic Matter, the correlation coefficient with the COD is very low. Thus the Organic Matter shall not be selected as a predictor. Since the Dissolved Oxygen concentration is not provided in the dataset, COD characteristics will be neglected in the predictor selection.
Theoretically, KMnO4, as an organic matter, is supposed to have a reasonable correlation with the COD because they are related compositionally. There might be some abnormalities in the raw data that causes this to occur.
Nonetheless, the coefficients are showing weak correlations between the parameters and the COD. Several parameters have a reasonably high coefficient value in one intake point but very low in another. Therefore, the parameters with the highest coefficient in all intake points will be selected as the predictors, TDS, SO4, and Fluoride.

MLR Model and Prediction Results
The predicted COD concentration will be obtained using the regression equation generated by the MLR models computed using R Studio. The package for MLR in R is tidyverse.  In the Ciliwung River intake point, the equation excluded the Fluoride, and this happened because the significance value (1-tailed) of Fluoride is 0.
After the regression equations are generated, the next step is to substitute the predictor variables in the equation with the dataset's value. Figures 3, 4, 5, and 6 show the comparison chart between the actual COD concentration and the predicted COD concentration in Curug Dam, Bekasi Dam, Cawang, and Ciliwung River intake points, respectively. One of the reasons why the prediction did not show a good result is because the water quality parameters contained in the raw data were not measured consistently over standard timeframes. Several months show the measurement results twice in the raw data, but in some other months, the measurement is only reported once. There were even some months where the measurement data is not available. Therefore, it is likely that the measured pollutant matrix will also be different.

MLR Model Validations using t-Test
The means of actual COD and predicted COD concentrations at all intake points are shown in table 6. Those values were calculated using Microsoft Excel by using the average formula. From the table, descriptively, there are no differences between the actual and predicted value at all sampling points. To statistically prove the hypothesis, the t value will be noticed. The R package for t-Test is ggpubr. Table 7 shows the t value obtained from the t-test.  Table 7 shows that the t Counts in all intake points are less than the t Distribution table value. Therefore, Ha is rejected, and H0 is accepted, which means there are no differences between them. Since there are no differences between the means, it indicates no effects of the predictor variables on COD concentration as the response variable. Thus, the prediction is not reliable.

Conclusions
Based on the study's result, the parameters that are selected to be the predictor of