Multivariate monthly water demand prediction using ensemble and gradient boosting machine learning techniques

29 Abstract — Water management planning requires reliable and accurate water demand forecasting. Water demand prediction is affected by variables, such as climate, socio-economic, and demographic data. This paper investigates urban monthly average water demand prediction, using classical, ensemble, and gradient boosting-based machine learning models, using the available monthly water demand, climatic, economic, and demographic data. Three train-test data split schemes on water demand timeseries were considered to determine the effect of data size on water demand prediction. Sensitivity analysis was employed to reduce input feature dimensionality while maintaining model accuracy. A univariate timeseries (water demand only) produced R2 scores up to 0.91, which increased to 0.94 with the addition of calendar and climatic features. Increasing the training data size from 70% to 90% improved the RMSE and MAE scores by ensemble and gradient boosting methods, with the random forest and the AdaBoost models showing improvements of up to 69%. The sensitivity analysis revealed a successful input reduction scheme from a potential 17 input attributes to seven inputs. Gradient boosting models showed robust and faster execution time, especially with the increase in training data, which is attractive for medium-term urban water demand forecasting.


INTRODUCTION
ater demand prediction is an essential paradigm for water resource planning and utilisation in the water sector. It can strategically inform water sources and treatment plants' development and expansion for the present and future generations through water demand prediction efforts. Water utility authorities make strategic decisions from water demand forecasting models. Many countries in the world are expected to face water shortages. Global population growth, climatic change, and Unrestricted use (including commercial), distribution and reproduction is permitted provided that credit is given to the original author(s) of the work, including a URI or hyperlink, this public license and a copyright economic advancement continue to pressure the already scarce water resource. These factors point to the need for careful planning and management of the available water resources [1]. Knowledge-driven methods were used to develop shortterm water demand prediction models using temperature and other weather forecasting information [2,3]. Knowledge-driven methods consider explaining and accounting for the factors that affect water demand, such as population, economy, temperature, and rainfall, among others. Data-driven techniques are broadly categorised into traditional and artificial intelligence-based (A.I.) procedures account for most water demand modelling studies.

A. AI-based water demand prediction studies
Machine learning techniques provide a useful approach to water demand prediction by learning the structures in the input and output data patterns from historical data.

B. Classic and hybrid ANN
A wavelet analysis on water demand timeseries combined with an ANN (wavelet-ANN model) outperformed the MLR, multiple nonlinear regression (MNLR), ARIMA, and ANN for daily water demand forecasts using weather variables [4]. Rainfall amount was a useful variable for weekly municipal water demand forecasting using the ANN model [5]. A seasonal wavelet transform-based model, combined with ANN, was effective in monthly water consumption prediction, compared to a discrete wavelet transformed ANN [6]. Prediction models were developed for monthly water demand forecasts for a city in Turkey, using neural network variations on eight yearlong datasets, with correlations between actual and predicted values reaching as far as 0.78 [7]. The neural network variations and Knearest neighbour techniques were used for daily, weekly, notice. This article has been subject to single-blind peer review by a minimum of two reviewers. W and monthly water demand predictions, using nine yearlong datasets in Iran [8]. A constant rate model [9] was successfully used to model water demand in the United Arab Emirates. A dynamic ANN using weather variables outperformed the ARIMA, and traditional ANN models for hourly, daily, weekly, and monthly urban water demand forecasting [10].

C. Fuzzy and hybrid systems
A 9-year water consumption profile was used for water consumption prediction for Istanbul city using a fuzzy logic system that performed better than the ARIMA model [11]. Fuzzy cognitive maps outperformed the ARIMA, linear regression, Holt-Winters, and ANN models in daily water demand forecasting [12]. A parallel adaptive weighting strategy [13] was effective in short-term water demand forecasting. A successful fuzzy inference system tool was designed for monthly agricultural and rural water reuse [14]. A hybrid fuzzy neural network model was proposed in [15] to forecast daily water consumption in Tehran. An ANFIS model was proposed to predict the hourly water demand in northern China [16]. Fuzzy and neuro-fuzzy models were developed for daily water demand forecasting in Tehran using climatic and calendar variables [17].

D. Deep learning techniques
Genetic programming techniques in water demand forecasting were reviewed in [18]. Extreme learning machines were used for short-term urban water demand forecasting [19], while a deep belief network proved useful for daily forecasting of water demand [20]. Most recently, hourly and daily urban water demand forecasting was conducted using the popular Long-short-term memory (LSTM) deep learning model [21]. The review paper [22] summarises the available studies that have employed traditional and AI-based techniques.

E. Contribution of the study
Ensemble and gradient boosting tree-based machine learning methods have demonstrated great success in different application domains [23]. However, these ensemble and gradient boosting methods have been scarcely applied for short-term water demand modelling [16], [17], [18], and rarely applied for medium-term (monthly) water demand forecasting. Despite the popularity of deep learning models, machine learning models are still competitive, particularly in small training data situations, shorter training time, and minimal parameter tuning requirements. This study discusses ensemble learning tree-based algorithms for monthly water demand prediction. Water demand data is challenging to collate because of financial and time constraints; thus, determining the most critical inputs for creating accurate water demand models is an invaluable process. In this study, in addition to developing robust water demand prediction models, a well-defined inputs selection process that aids model training efficiency while maintaining prediction accuracy is explained.

II. PREDICTIVE ALGORITHMS FOR WATER DEMAND FORECASTING
This section gives a brief overview of the machine learning algorithms used for water demand prediction. Prediction of water demand uses regression analysis to determine the input-output relationship that is, to ascertain the relationship between the independent and the dependent variables. Five models comprising the classical and ensemble predictive machine learning algorithms, namely, multiple linear regression (MLR), decision tree, random forest (R.F.), AdaBoost, and LightGBM, were used to map the relationship between inputs and outputs. The need to explore more algorithms and water demand prediction accuracy improves, making up the primary consideration for selecting these models. Furthermore, the encouraging prediction performance results in similar water demand studies [32-35] inspired further investigation of these and other models. A flow chart on choosing the monthly average water demand modelling process is shown in Fig. 1. It begins with data collection, understanding the data, and visualising the timeseries and drawing up data summary statistics. The next step is to pre-process the data to get it ready for modelling, which comprises steps like standardising the data and converting the timeseries to supervised learning. The model selection step includes deciding on the model parameters, the input sensitivity analysis, and the model's training and testing step. Finally, the model is evaluated, and if the results are not satisfactory, the process reverts to sensitivity analysis again (dashed line).

A. Random forest model
Random forest (R.F.) is an ensemble-based learning algorithm used for regression and classification problems [24]. R.F. is an ensemble of models that uses the decision tree approach to data collection. R.F. overcomes the decision tree's instability problem by using a set of trees rather than a single tree for prediction. Initially, an individual tree is trained by taking note of a random subset of observations. A random subset of the variable is then considered to split the decision, thus creating a diverse set of trees essential for improving the ensemble model's where X = x1, x2,…, xn is a n-dimension input vector. The final ensemble has H outputs: where Y′ % is the prediction value by decision tree number H. The average score of all the single randomly generated decision trees is defined as Y'.
R.F. models compensate for the decision tree weaknesses of instability by using results from a set of trees.

B. Adaptive Boosting (AdaBoost) ensemble model
Boosting in ensemble learning models attempts to create more robust models from several weaker regressors or classifiers. To correct the original first model's errors, boosting creates more models until the maximum number of models is created or required training and data accuracy is attained. AdaBoost is primarily used to boost any machine learning model's performance, primarily weak learners [25]. Decision trees are the popular candidate models considered for boosting and are adopted in this study.
Using the training set To the ith response value & ∈ ℝ, i = 1,…… , N on the prediction function [26] ) ( ) = 6 ∈ : 8 where inf represents the infimum of the resulting set, ℎ * is the , base learner, and is the coefficient of the , base learner.

C. Light Gradient Boosting (LightGBM) ensemble model
The LightGBM algorithm is another gradient boosting decision tree (GBDT) technique that determines the split value by using the gradient-based one side sampling (GOSS) and exclusive feature elimination [27]. The LightGBM algorithm grows its trees leaf-wise (starting with the best) in which the leaf with maximum delta loss is chosen to grow. Leaf-wise tree growth is different from the depth-wise tree growth implemented by conventional ensemble techniques in that it does not calculate the information gain of all the samples, which allows it to have faster convergence and improved accuracy. However, leafwise tree growth often results in overfitting when data is small. There is a need to guard overfitting by adjusting parameters, for example, the minimum data in a leaf. The optimisation method of the LightGBM algorithm aims to retain and preserve some of the smaller gradient samples (one-side sampling) while keeping all larger gradient samples, reduces the search space, and strengthens the information gain from the smaller gradient samples. A more detailed explanation of the operation of the LightGBM algorithm is given in [27].

D. Extreme Gradient Boosting (XGBoost) ensemble model
XGBoost is an alternative framework for implementing gradient-boost trees that use histogram-based statistics to represent discretised feature values [28]. XGBoost is popular in algorithm competitions, such as Kaggle, because it is a fast, scalable, and efficient tree boosting stochastic gradient machine learning algorithm. In this study, the XGBoost algorithm is adopted for timeseries forecasting. The XGBoost model is an ensemble of new trees that aim to correct the errors of other trees within the model. While the gradient boosting model uses only the first-order derivative information during model training, the XGBoost implements the more accurate Taylor expansion, leveraging the cost function's first-order and second-order derivatives information. The XGBoost model handles well defined missing values processing strategies making it robust. Also, the use of the L1 and L2 regularisation terms ensures stronger generalisation and prevents overfitting. The finalised loss function can be represented as below [28]. While the GBDT uses only the first-order derivative information when training the model, the XGB performs the Taylor expansion of the cost function and uses the first-order and second-order derivatives information. A simplified XGBoost objective function at step t, which is a sum of quadratic functions of a variable, can be written as [28], where and are first and second-order gradient statistics on the loss function, is the function variable, represent the input observation, Ω is the regularisation term.

III. DATA COLLECTION
Melbourne is a coastal city located in southeastern Australia with a population of almost 5 million people covering an area of 9.992 km 2 of land. Melbourne's water is mainly sourced from protected catchments and the states' forests. According to the Koppen climate classification, Melbourne city has a temperate oceanic climate. The city warms up in summer with mean temperatures between 14 -25.3°C, and the winter averages from 6.5 -14.2°C.
Melbourne's water demand (W.D.) dataset does not contain missing values; all the calendar timestamp values are present. A total of twelve potential climatic and economic input variables were investigated to test their effect on water demand prediction. These input variables are temperature (T°C), rainfall (Rn) in mm, solar radiation (Rs) in M.J./(m 2 ), evaporation (Ev) in mm, water price in $, gross domestic product (GDP), water restriction levels 1-4 (e.g. RL1, RL2, RL3, and RL4), population size (Pop) and previous month water demand. In addition to these input variables, calendar inputs (Cals), comprising the quarter of the year, the month of the year, the day of the week, the day of the year, week of the year (i.e. a total of 5 inputs), were also included to bring the number of the total input to 17 variables. The output is chosen as the current water demand values.

A. Dataset visualisation
This section visualises the water demand timeseries' temporal structure and the observations' distribution using plots to learn and derive useful insights for use during the modelling process, such as model choice. Fig. 2 shows the decomposed water demand timeseries and its elements like the trend, seasonality, and residuals. The line plots show an undulant water demand profile each year, with high demand observed towards the end and beginning of the year. In the trend plotting, an increasing trend can be seen up to 1997, and then a decreasing trend takes precedent until 2011. After 2011, a rising water demand trend again begins to manifest. A robust seasonal fluctuation is evident in the timeseries, while the residuals are highly irregular in the pattern -all these factors present challenges during the modelling process. The spike in 1997 may be attributed to the start of drought in the city, leading to the high-water demand.
The available dataset ranges from January 1983 to December 2016, equivalent to 34 years of monthly average observations (408 monthly observations).
Melbourne's mean water demand was 744.3 ML, with a maximum water demand of 1325 ML obtained during the peak of the summer season. The lowest water demand value of 500.8 ML was recorded during the winter season. The heaviest average monthly rainfall event recorded 186.8 mm, while the peak air temperature was 30.50 °C.

B. Selection of effective lags for water demand modelling
The formal definition of a lag can be expressed as for data values, Y1, Y2, …, Y.N., as the k-period (or kth) lag of the value Yi that occurred k time points before the time i: A lag plot shown in Fig. 3 is a timeseries visual analysis technique used to identify effective lags. This lag plot shows a water demand timeseries scatter plot and its lagged values. The lag plot can check correlation possibilities between water demand this month and the previous lagged months and check for outliers. The correlation between water demand and its lagged values can be positive or negative, while an unidentifiable pattern represents a no relationship. Lags 1, 12, and 24 of Fig. 3 show moderate correlation, while lag 6 shows a weak correlation to the water demand timeseries (yt).
Following the presented visual analysis technique, lags 1, 12, 24, 36, and 48 were selected for use as inputs and were combined with the climate and calendar inputs.

C. Variable correlation and feature selection
The Pearson correlation coefficient [29] describes the linear relationship between two Normally distributed datasets. The coefficients vary between -1 and +1, and a zero value implies no correlation, while correlations of -1 and +1 point to an exact linear relationship. Positive correlations suggest an increase in x with an increase in y and vice versa.
Both positive and negative correlations exist among the available input datasets, with rainfall, water price, GDP, and population being negatively correlated to water demand. Evaporation and temperature values have the highest correlations to water demand. The results from the correlation table assist in input selection and dimensionality reduction.
The redundant features in the dataset can negatively influence model accuracy, training time and result in overfitting. There is a need to select features that significantly affect the prediction output of choice. There is a strong correlation among climatic variables, like, solar radiation and evaporation, temperature and evaporation, and temperature and solar radiation. The high correlation necessitates using just one input (from the three inputs). Using all the three inputs introduce redundancy, as these inputs will have an almost similar effect on water demand. Following these input datasets, strong correlations exist between population and GDP, and price and population necessitate selecting one input from these two inputs. Overall, from 5 highly correlated potential variables, only 2 (e.g. population and temperature) were chosen as representative variables due to the relative ease of obtaining those data.

D. Re-framing series to supervised learning
The water demand timeseries dataset can be restructured into a supervised learning problem using the Pandas shift function [30], which takes in the values 1, 12, 24, 36, and 48 (previous time step values) to predict the next month's water demand value (next time step value). The Pandas shift function helps create columns pushed forward or backward for both lag observations and forecast observation.

E. Water demand timeseries training and testing set split
Timeseries forecasting involves making predictions into the future, with the training and testing set used for model creation and model evaluation. The test set should comprise the most recent data to simulate the natural real world set-up. In a timeseries, random sampling is not ideal as it creates a look-ahead scenario, where future water demand observations may be contained in the training set after random sampling. The other shortfall of this splitting strategy (i.e. random splitting) negates the temporal component embedded into the timeseries. Therefore, timeseries data should be split following the temporal order. A multiple train/test split strategy of 70-30, 80-20, and 90-10 following the water demand timeseries' temporal order are adopted in this study. This split strategy provides a more robust estimate of the expected model performance across the various data size.

F. Data transformation
The considerable differences in numerical ranges between the input and output values require that the dataset be standardised. Standardisation scales each feature such that the distribution is centred around 0, with a standard deviation of 1. Standardisation allows comparability among input datasets, and it also enhances the training efficiency of models since the numerical condition of optimisation is improved. As such, the mean and standard deviation for each feature is calculated, then the feature is scaled based on (5); = 6 " 78 9 (5) where z is the standardised value, xi is the observed water demand value, µ is the mean, and σ is the standard deviation of the water demand dataset.
All machine learning algorithms were implemented in the Python programming language using the Scikit-Learn library [31]. All model development and experimental tasks were conducted on a Windows operated computer (Intel Core i5 2.40GHz 8GB RAM).
Assessment of the models' skill is done using the mean absolute error (MAE), root-mean-square error (RMSE), and R-squared (R 2 ) [10] evaluation metrics.

IV. RESULTS AND DISCUSSION
The need to discover more efficient and accurate water demand forecasting models motivated the selection and testing of s developed machine learning models for the average monthly water demand prediction task. Data were available from January 1983 to December 2016, where three splits of training and testing data were created: 70-30, 80-20, and 90-10, respectively. Seventeen potential inputs were tested on their effect on water demand, and the obtained results were discussed in this section. Following a series of extensive experimentation to identify the most significant lag, the study noted that lag 12 produced the best performance according to the set evaluation metrics. The results discussed in this section are for lag 12 only to allow for brevity and adequate discussion on the findings.

A. Input sensitivity analysis
An understanding of the essential input datasets for accurate water demand prediction is crucial for efficient model performance and reducing data collection costs. The selected and tested input combinations comprised of Input combination (i) simulates the limited data scenario where only the water demand variable is available. Input combination (ii) takes advantage of feature engineering, wherein calendar inputs are generated in addition to water demand values only. The effect of temperature is tested in the input combination (iii), which simulates the most common and readily available input combination. The impact of using climatic variables only is tested in (iv). The effect of dimensionality reduction is tested in input combination (v), where only those inputs with no strong correlations with other inputs are considered. The temperature has high correlations with evaporation and solar radiation, and so does population and water price, in which case only temperature and population are chosen as representative inputs. Lastly, input combination (vi) simulates all available datasets used regardless of any relations. Fig. 4 shows the relationships between input size and the prediction RMSE values, with the number of inputs on the x-axis and RMSE on the y-axis.
Apart from the MLR model, all other models show minimal variation in RMSE scores (model performance) with the increasing number of inputs, as seen in Fig. 4. The MLR model scored the highest performance with one input (previous W.D. at lag 12) for all tested training dataset sizes. Any further addition of input results in a deterioration of model performance. The MLR model captures the linear relationship between inputs and outputs, with input combination (i).
Further addition of other variables resulted in degradation in model performance. It is advisable to use the input combination (i) only when considering the linear regression model. This finding clues to nonlinear relationships between inputs and water demand. The rest of the models record their lowest RMSE values between seven inputs (i.e. combination (iii) of W.D., Cals, and T) and nine inputs (i.e. combination (iv) of W.D., Cals, T, SR, and Ev) regardless of training data size. The best performing model (LightGBM) has input co mbination (iii) of W.D., Cals, T, Rs, Ev with the lowest RMSE score (20.86 ML) and the highest R 2 value (0.94). This result is consistent with the 70-30 and 80-20 testing splits. However, for the LightGBM, an input combination without Rs and Ev inputs suffer a loss in performance of 7% RMSE and 2% MAE evaluation metrics. Compared to the opportunity cost of data collection and model efficiency, a model using W.D., Cals, and T inputs are still favourable. Table II shows the models' performance results in the testing phase following the sensitivity analysis exercise, and the results confirm the benefits of eliminating redundant features. The LightGBM model and most models using all inputs record higher RMSE scores, relative to fewer inputs, and this finding is consistent across all data sizes and tested models. It can be concluded that the dimensionality reduction efforts through the sensitivity analysis exercise are beneficial. There was no significant loss in accuracy due to using fewer input variables.

B. Effect of data size
A large training set risks model overfitting, while smaller training data size results in high variance model parameters. As indicated earlier, the evaluation metrics of three training/testing splits contrived from the Melbourne City monthly water demand dataset are shown in Table II for the tested models and input combinations. A general improvement in all models' performance occurs with the increasing training data size and the number of input combinations. RMSE and MAE values for the random forest and AdaBoost were improved by up to 69% as the training data size increased from 70% to 90%. The 90-10 training/test split strategy had the best results for all tested models and input combinations, followed by the 80-10 and lastly the 70-30 split strategy. The observation that all models' accuracy tends to increase with increased training data size may be attributed to the existence of more recent water demand observations in the training data. Thus, the latter values of water demand (closer to 2016) are more relevant to predicting water demand than earlier observations. The superior LightGBM model hyperparameters that provided the best fit to the data comprised a learning rate of 0.04, with five leaves forming a tree with a maximum depth of three. At the same time, a bagging frequency of five and a feature fraction of 0.7 at each iteration ensured quick convergence. The LightGBM model also demonstrated a faster training time among the tested ensemble methods. Random Forest and AdaBoost models offered competitive accuracy; however, they experienced longer training time than the LightGBM model, making the LightGBM model the choice model in this prediction task. Fig. 5 shows the predictions of all the models against the actual water demand values for the duration of the testing phase period using the 90-10 train-test split scheme. All the models have learned the pattern of rising and falling water demand and trends. However, the MLR and decision tree models deviate from the actual observations, with the latter models almost always underestimating all lower water demand values. On the other hand, the R.F. model overestimated the observed dataset's higher water demands. However, the gradient boosting-based models, the LightGBM, XGBoost, and AdaBoost models, showed robustness as they closely followed the actual observations' shape more accurately throughout the dataset's length T The state-of-the-art R.F. model did show some competitive performance. However, it did overestimate the peak values of water demand, as shown in Fig. 5. The R.F. model could not supersede the gradient boosting models' performance at all the considered train-test splits. The R.F. model attempts to reduce the complexity of the models (bagging). In contrast, gradient boosting techniques increase model complexity which was favourable for this dataset. As a result, the gradient boosting models by adding more complexity superseded the R.F. model. However, because the R.F. model is more straightforward to tune than its gradient boosting counterparts, it remains a reasonable choice for this prediction task.

V. CONCLUSION
Accurate monthly water demand prediction models are crucial for strategic planning and resource allocation for city water supply. Most water demand studies have been conducted for essentially short-term water demand predictions, with limited studies examining monthly (medium-term) water demand predictions using the treebased gradient boosting methods. The monthly water demand studies considered mostly univariate water demand timeseries using the smaller datasets because of water demand data collection difficulties. This study investigated the multivariate monthly water demand prediction using three different training data sizes and in addition, conducted a sensitivity analysis to determine the most critical factors affecting urban monthly water demand in Melbourne. Seventeen potential inputs comprising climatic, economic, demographic, calendar, and previous water demand observations were investigated on their effect on water demand prediction.
The sensitivity analysis was conducted for input dimensionality reduction and model training efficiency improvement, leading to data collection cost reduction. The sensitivity analysis exercise determined that input combinations comprising the previous year's water demand observations, calendar inputs, and climatic inputs, were essential for improving the monthly water demand prediction. The gradient boosting group of machine learning models, mainly the LightGBM model offered efficient, robust and satisfactory accuracy, with the R2 score reaching 0.94. The study observed that, for the city of Melbourne, more current observations of water demand and climatic inputs (closer to the year 2016) are critical for improving monthly water demand prediction. In situations where only Melbourne's monthly water demand observations are available, reliable monthly water predictions with R 2 scores of up to 0.91 are still possible,  particularly with the gradient boosting LightGBM model and XGBoost models. The development of monthly water demand prediction models is greatly hindered by data availability and collation, which has limited research advancement in medium to long-term water demand prediction studies. Future works will investigate transfer learning, which allows the transfer of information from water demand data-rich cities to cities with limited data to enhance their (limited data cities) model prediction accuracy, will be investigated.