Multivariate data mining for estimating the rate of discoloration material accumulation in drinking water distribution systemss
Author: Mounce, S. R., Blokker, E. J. M., Husband, S. P., Furnass, W. R., Schaap, P. G., Boxall, J. B.
Full Paper




Particulate material accumulates over time as cohesive layers on internal pipeline surfaces in drinking water distribution systems (WDS). When subsequently mobilised, this material can be responsible for causing discoloration and other customer impacting water quality issues. This paper explores some of the factors that are known or suspected to be involved in the accumulation process by applying data driven techniques to significant amounts of real world field data. The analysis of such data is challenging given the complexity of the underlying processes and the inability to directly observe and measure those processes. Two complementary machine learning methodologies are applied for multivariate data mining of observed phenomena from both a qualitative and a quantitative perspective. Firstly, Kohonen self-organizing maps (SOM) were used for integrative and interpretative multivariate data mining of the potential factors affecting accumulation. The visual output of the SOM analysis provides a rapid and intuitive means of examining covariance between variables and exploring hypotheses for increased understanding of the regeneration phenomena across datasets. Secondly, Evolutionary Polynomial Regression (EPR), a hybrid data-driven technique, was applied that combines genetic algorithms with numerical regression for developing simple and easily interpretable mathematical model expressions. Multiple models are generated by simultaneously optimizing fitness to training data and parsimony of resulting mathematical expressions (in terms of numbers of terms and equation complexity). EPR was used to explore producing novel simple expressions to capture and highlight the important factors in the accumulation rate of discolouration material, based on the flushing programme data. Three case studies are presented at two scales: UK national and two Dutch local detailed studies. The results highlight bulk water iron concentration, pipe material and looped network areas as key descriptive parameters for the UK national scale study. At the local level, a significantly increased data set for the third case study allowed K-fold cross validation. The mean cross validation Coefficient of Determination (CoD) was 0.945 for training data and 0.930 for testing data for an equation utilising amount of material mobilised and soil temperature for estimating daily regeneration rate. The approach shows promise for ultimately developing transferable expressions (for example incorporating pipe diameter) usable for pro-active WDS management.