Elsevier

Energy and Buildings

Volume 150, 1 September 2017, Pages 432-446
Energy and Buildings

Engineering Advance
Building performance evaluation through a novel feature selection algorithm for automated arx model identification procedures

https://doi.org/10.1016/j.enbuild.2017.06.009Get rights and content

Highlights

  • We developed a novel heuristic algorithm called GCS (Greedy Correlation Screening) for feature selection problem in ARX model.

  • The performance of GCS algorithm is compared with GRASP and GA.

  • Proposed method to select best ARX model among a list of better ones.

  • Methods tested to define a flowrate models of heating system in a real case study.

Abstract

ARX models are an effective instrument to evaluate continuous building performance from insufficient monitoring data. However, selecting the right model features is NP-hard. The problem of finding a minimal subset of informative inputs has been studied extensively in various fields but automatic, fast, and reliable procedures for finding optimal models for building performance evaluation are still missing. We propose a novel feature selection algorithm named Greedy Correlation Screening (GCS), which identifies a possible solution at a time by greedily maximizing the correlation between inputs and output and minimizing cross-correlations between inputs. These two objectives are competing, thus leading to best tradeoffs. Among these, the best model is automatically selected by applying filters and quality criteria such as the adjusted coefficient of correlation and non-correlation of residuals.

The performance of the proposed heuristic method is compared to two of the best algorithms used in the field, such as GRASP for feature selection and NSGA-II (Non-dominated Sorting Genetic Algorithm). The application on a real case study demonstrates that the proposed method solves the problem of feature selection in building performance estimation efficiently and reliably. Moreover, the model creation is automatic, making it ideal for integration into a Building Management System (BMS) in order to detect faults and perform short-term predictive control.

Introduction

Buildings rarely perform as intended. It is well known that buildings have a high impact on the overall energy consumption and CO2 emissions. Final energy consumption in building sector in Europe has increased by around 1.5%/year by non-residential building and by 0.6% for household since 1990 [1]. In the last few years, the use of Information Communication Technology (ICT) devices together with Building Automatic and Control System (BACs) and Technical Building Management (TBM) has demonstrated to be a promising method to decrease the energy consumption especially in existing buildings. Several European projects report the benefits of installing ICT devices to reduce the energy consumption of buildings. All of them confirm how the installation of a monitoring system can assess real building performance and produce thermal energy and electricity savings of up to 20% after an appropriate system optimization. Moran et al. [2] review and analyze results from EU pilot projects dealing with the application of ICT on buildings to reduce energy consumption. The authors show how 52 pilot sites, i.e. almost 50% of the analyzed case studies, achieved the energy saving target of 20%. They found that this value depends on several factors, such as level of tenant motivation, perceived thermal comfort, quality of social interaction and communication and ICT support.

Aghemo et al. [3] found that energy consumption for heating and cooling in a historic building can be reduced up to 60% designing smart ICT based services to monitor and control environmental conditions, energy load and plant facilities. The same research focusing on 10 offices proved that lighting savings up to 32% are achievable after the deployment of an automated lighting control system. The work of Roisin et al. [4] showed that lighting energy savings ranging between 41% and 61% are possible in an office building applying different control systems. In this case, the variation is due to location and orientation of the offices. Ippolito et al. [5] studied the application of different classes of ICT in residential buildings according to UNI EN 15252 [6], concluding that the application of BACS and TBM can improve the energy performance of buildings up to 15% and that their installation is convenient if the building has a high energy consumption.

In a broader view, the use of ICT in buildings could be part of the commissioning process, specifically continuous commissioning: a process to make buildings work properly, solving operating problems, improving comfort, optimizing energy use and identifying strategies to reduce energy consumption. In Bynum et al. [7] an automated Building Commissioning Analysis Tool is presented. By monitoring the most important energy flows of the building and applying a top-down approach, the authors showed that it is possible to detect faults in the systems or in the forecast of the building energy consumption. In other studies [8], [9], [10], [11], a continuous commissioning methodology has been applied finding considerable energy savings, from 5% to 30%, through monitoring and improvement of control strategies.

One of the first steps in the cited studies is the creation of a model for estimating building performance in order to quantify energy savings and comfort improvements generated by a change in the control of the building systems, as described by the International Performance Measurement and Verification Protocol (IPMVP® [12]). This work stems from a similar requirement encountered in the SmartBuild project [13]. The main objective of the project was to quantify the energy savings achievable by optimal building control through ICT devices installed in an office building in San Michele all’Adige, located in the Province of Trento (IT). To improve the control of the heating equipment for each floor of the building on which a BMS was installed (two of three floors), we needed to know the energy consumption of each floor. From the bills, only an aggregated consumption at building level was available. As we were not allowed to shut down the heating system and therefore could not install a permanent flow meter, we measured flowrates for five consecutive winter days with a portable ultrasonic flow meter. With the so acquired data, we identified a model of the flow rate for each floor.

Three model types for building energy modelling are generally in use: statistical or black box models, physics-based or white box models, and a mix of the first two called grey box models.

Pure black box models [14], [15], [16] as we intend them here are derived from monitored time series only and do not require additional information on the building. White box models [16], [17] are based on physical principles, which require calibrated parameters to be able to accurately predict building performance. These parameters might not be readily available. Moreover, every white box model is based on assumptions and simplifications. This leads to a gap between building simulation and monitoring data.

Grey box models are derived from physical principles but contain unknown parameters that can be estimated from monitoring data. Most of the studies using grey box models on buildings have as the main purpose the prediction of energy performance of a single element, such as a single wall or facility components, or of a simplified building model [16], [18], [19], [20], [21].

In this study, we consider black box models. They can be linear or non-linear in the input variables. Among the non-linear models, Neural Networks (NNs) are widely used. Several studies [22], [23], [24] have demonstrated that NNs are highly reliable in predicting building performance. However, as shown in [25], [26], due to the complexity of generating NNs, their limited ability to explicitly identify possible relationships and their higher demand of computational resources, we consider in this paper only linear models of ARX type (autoregressive models with exogenous inputs). The latter are essentially linear regression models where the linear combination of time series consists of time-shifted and possibly transformed monitoring data. Because of their linearity, the creation of such models is simpler and faster than for non-linear models, and the user is facilitated in interpreting model parameters.

The ARX model structure is considered as the simplest one to find analytical solutions with excellent performance and an accurate description of the physical model [27]. Regression analysis is used in [28] to predict the energy consumption of a residential building, finding that this type of model had reasonable accuracy and could be implemented easily compared with other methods. Other studies consider the application of linear models, such as multiple regression models, either to evaluate the energy consumption in air-conditioned office buildings in different climates [29] or to develop energy consumption indicators for U.S. commercial buildings [30]. In these studies, multiple regression analysis is defined as a simplified version of dynamic model aimed at predicting energy demand as a function of environmental variables lagged in time.

Even though it is well known that the building has a non − linear behavior, different studies show how it is possible to identify building performance using linear models. The first and simplest one is the energy signature of the building. The studies of Ghiaus [31] depict how it is possible to predict energy consumption of building by using a robust linear regression. In particular, it found a relative error of 0.5% for the data in calibration and of about 5–10% for validation using different sets of data. Dong et al. [32] demonstrate that the variations of energy consumption in sixth commercial building in Singapore are related to the outdoor dry-bulb temperature and can be well predicted at 90% confidence level only with it. Virk [33] used a stochastic multivariable models to predict both temperature and humidity in the test chamber. According to Virk “The complexity associated with nonlinear models is in general non desirable for practical purposes, a linear modelling approach often being adopted and frequently being found to be adequate”. Jimenez et al. [34] successfully used multi-output ARX model to identify the U and g values for two different building components from outdoor testing in test cell, assuming the system to be linear and time invariant. Zmeureanu et al. [35] developed an energy rating system for existing houses using a linear modelling approach.

Building linear models with a minimum number of variables without overfitting or underfitting represent a major concern. Overfitting occurs when a model captures the noise of the training data and thus fails on validation data. Underfitting means that the model cannot fit the training data well enough.

Generally, the number of sensors installed in the building and thus the amount of data acquired depends on the type of building and technical systems, the purpose of the monitoring, the facility manager’s requests, and the available budget. However, the more information is available, the more it is difficult and time-consuming to generate a good model. This is due to different reasons: 1) some variables may be uninformative for the selected output, increasing the noise of the model; indeed, it is not possible to know a priori what the most important variables affecting the output are; 2) a high cross correlation between variables makes the model equation more difficult to interpret; 3) a smaller set of features identified during the training period may produce better predictions during validation [36], [37] because of a lower risk of overfitting and multicollinearity; and 4) a high number of variables increases the computational time to create a good model. For these reasons, “feature selection”, i.e., selecting the minimum number of features required to best describe a selected output, is a crucial step in the creation of a meaningful ARX model, which is widely used in different field [38], [39], [40], such as medicine, economy, or biology [41], [42], [43] but rarely applied to building performance evaluation.

In the last 40 years, numerous studies on this topic have been carried out [37], [38], [40], [44], [45], [46], [47], proposing various deterministic or heuristic algorithms, each with advantages and disadvantages.

The simplest deterministic algorithm is sometimes called “brute force” because it tries all input combinations, generating 2n models, and is thus prohibitive for large number of characteristics n. Heuristic algorithms try to generate a model close to the best one in a reasonable time. Several heuristic strategies are proposed in the literature. According to the search methods, they can be divided in three main branches: wrappers, filters, and embedded.

Wrappers assign a score to a subset of features based on their predictive power. The most popular algorithms of this type are Forward Selection (FS) and Backward Elimination (BE). FS builds the model starting from a constant term and adding one variable at a time in decreasing order of correlation with the model output. The process finishes when either all variables have been included in the model or no significant improvement according to a specified criterion is obtained. On the contrary, BE starts with all variables included in the model, deleting negligible variables one at a time. Usually, the more a variable helps reduce the sum of squared errors, the more likely it is kept.

FS and BE are largely used because of their simplicity. Their major drawback is that the quality of the models created depends strongly on the order in which features are added or removed.

Filters select model features according to a measure computed from the data. The measure may be information, distance, dependence, or consistency. Algorithms based on filters are Genetic Algorithms (GAs) [48], the Max-min method [47], and Branch and Bound [49].

Embedded methods try to use the advantages of both wrappers and filters. They learn which feature is the best contributor to the accuracy of the model while the model is being created [46]. They generally use the independent criterion (as in filter approach) to select the best subsets of variables for a known cardinality. Successively, the optimal subset is chosen among the list by using learning algorithm (as in wrapper approaches).

Most of the cited algorithms become prohibitive if the number of features increases. One reason is the curse of dimensionality, which states that the more dimensions a search space has, the less one can explore it fully.

Even if computation time is not an issue, evaluating model quality should not only take into account the goodness-of-fit or predictive power of the model but also residuals and multicollinearity. These latter are essential to validate the model. Indeed, even if the coefficient of determination adjusted (R2adj) during the validation is high, the possibility to have a model with high multicollinearity of variables could be also high. In that case, the model equation is meaningless.

To our knowledge, an automated procedure able to create ARX models for integration in a Building Management System (BMS) basing on monitoring values is still missing. This paper proposes such a procedure. It consists of two steps. In the first step, a novel heuristic algorithm, which we named GCS, Greedy Correlation Screening, builds potentially suitable feature sets. The second step applies various criteria to these sets to select the most suitable ARX model.

We applied the procedure on 5 days (from 5th of December, 3 pm to 10th of December, 2 pm in 2015) of monitoring data collected in the aforementioned office building. The building was a case study of the SmartBuild project [13]. We then compared the performance of the GCS with the high-performing heuristic feature selection algorithm GRASP [38], [45], [50] and the general-purpose genetic algorithm NSGA-II [51].

The model selection methodology presented in this paper is applicable in different types of buildings and for different purposes, such as control, fault diagnosis and short-term prediction. Depending on the application, the model can have various outputs, such as energy consumption, indoor temperature and humidity, or, as shown for the case study treated in this paper, the flowrates in an office building.

This paper is organized as follows. Section 2 gives a brief overview of the background of the topic, showing the problem of selecting regression variables together with feature selection algorithms (GRASP, GA and GCS). In Section 3, we describe the novel methodology to create a full range of ARX models for a problem and to select the best one according to a series of criteria. Section 4 shows how the methodology applies on the case study. Section 5 presents the obtained results, which are then discussed in Section 6. Conclusions are given in Section 7.

Section snippets

Background: the feature selection problem in regression analysis

Numerous studies focus on the problem of selecting the best subset of predictor variables in regression [44], [52], [53], [54]. Our work is rooted in a heuristic approach [38], which is based on the assumption that good multiple linear regression models have high input-output correlations and low cross-correlations between inputs. More precisely, we are trying to solve the following multi-objective optimization problem:f1(S)=i,jS,j<i|ρji|minf2(S)=iS|ρ0i|max

ρ0i denotes the (Pearson

Multiple linear regression models

After the approximation of the Pareto front, a multiple linear regression model is built for each point on the polygonal curve.

Starting from a broad view, regression analysis is defined as a methodology to find the linear relationship between a selected output (or response), which is the dependent variable, and inputs (or predictors) which should be as much as possible independent variables. In buildings, variables describing a specific process, such as energy consumption for heating and

Application to a real case study

The methodology has been tested using data from a monitored laboratory and office building in San Michele all’Adige, Province of Trento, Italy, built in 1874 and renovated in 2000.The building has three floors of 425 m2 each. The ground floor hosts laboratories, the first and second floor offices. The façade consists of thick lime and stonewalls, with double-glazed wooden frame windows. A monitoring system has been installed to evaluate the Indoor Environmental Quality (IEQ) and the energy

Analysis of monitoring building data

This section presents the results obtained using the methodology introduced in the previous sections. The flow rate profiles of the 1st and 2nd floor are depicted in Fig. 6. Table 5 shows the respective results of the Augmented Dickey-Fuller (ADF) test.

After computing the correlation matrix based on inputs showed in Table 1, we applied three heuristic feature selection algorithms to our case.

In the following, results refer to the 2nd floor flow rate estimation. Fig. 7, Fig. 8 show the

Performance evaluation of GRASP, GA and GCS

In this section, the performance of the GCS algorithm is compared with the performance of the GRASP and GA. According to Figures Fig. 7, Fig. 8, the GCS performed well with respect to the GA and GRASP in the identification of the non-dominated solutions. In all cases, GRASP struggled to find some of the non-dominated solutions found by GCS. This is due to the difficulty in selecting the parameter a as explained in Section 2.3. The GA found a greater amount of non-dominated solutions than GRASP

Conclusion

The main problems encountered in automating model creation are finding the right parameters for a certain feature selection algorithm and combining the algorithm with a methodology selecting the best model according to a series of criteria.

On the one hand, this paper proposes a new heuristic feature selection algorithm, which we named GCS, based on maximizing output-input correlations, minimizing cross correlations between inputs, and approximating the Pareto front of the respective

Acknowledgements

We acknowledge and thank the SmartBuild project, project ID 297288, co-funded under the ICT Policy Support Programmeas part of the Competitiveness and Innovation Framework Programme by the European Community, for the permission to use real case study data and applying this methodology to them.

References (65)

  • A. Afram et al.

    Gray-box modeling and validation of residential HVAC system for control system design

    Appl. Energy

    (2015)
  • M.A. Fayazbakhsh et al.

    Gray-box model for energy-efficient selection of set point hysteresis in heating, ventilation, air conditioning, and refrigeration controllers

    Energy Convers. Manage.

    (2015)
  • H. Harb et al.

    Development and validation of grey-box models for forecasting the thermal response of occupied buildings

    Energy Build.

    (2016)
  • G. Reynders et al.

    Impact of the Heat Emission System on the Identification of Grey-box Models for Residential Buildings

    Energy Procedia

    (2015)
  • S. Paudel et al.

    Pseudo dynamic transitional modeling of building heating energy demand using artificial neural network

    Energy Build.

    (2014)
  • M. Benedetti et al.

    Energy consumption control automation using Artificial Neural Networks and adaptive algorithms: proposal of a new methodology and case study

    Appl. Energy

    (2016)
  • F. Ascione et al.

    Artificial neural networks to predict energy performance and retrofit scenarios for any member of a building category: a novel approach

    Energy

    (2017)
  • J.V. Tu

    Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes

    J. Clin. Epidemiol.

    (1996)
  • N. Fumo et al.

    Regression analysis for prediction of residential energy consumption

    Renew. Sustain. Energy Rev.

    (2015)
  • J.C. Lam et al.

    Multiple regression models for energy use in air-conditioned office buildings in different climates

    Energy Conversion and Management.

    (2010)
  • S.S. Amiri et al.

    Using multiple regression analysis to develop energy consumption indicators for commercial buildings in the U.S

    Energy Build.

    (2015)
  • C. Ghiaus

    Experimental estimation of building energy performance by robust regression

    Energy Build.

    (2006)
  • B. Dong et al.

    A holistic utility bill analysis method for baselining whole commercial building energy consumption in Singapore

    Energy Build.

    (2005)
  • G.S. Virk et al.

    Practical stochastic multivariable identification for buildings

    Appl. Math. Modell.

    (1995)
  • M.J. Jiménez et al.

    Application of multi-output ARX models for estimation of the U and g values of building components in outdoor testing

    Sol. Energy

    (2005)
  • R. Zmeureanu et al.

    Development of an energy rating system for existing houses

    Energy Build.

    (1999)
  • B. Eksioglu et al.

    Subset selection in multiple linear regression: a new mathematical programming approach

    Comput. Ind. Eng.

    (2005)
  • N.H. Jadhav et al.

    Subset selection in multiple linear regression in the presence of outlier and multicollinearity

    Stat. Method.

    (2014)
  • I. Inza et al.

    Feature subset selection by genetic algorithms and estimation of distribution algorithms: a case study in the survival of cirrhotic patients treated with TIPS

    Artif. Intell. Med.

    (2001)
  • P. Bermejo et al.

    A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets

    Pattern Recog. Lett.

    (2011)
  • S.C. Yusta

    Different metaheuristic strategies to solve the feature selection problem

    Pattern Recog. Lett.

    (2009)
  • P. Pudil et al.

    An analysis of the Max-Min approach to feature selection and ordering

    Pattern Recog. Lett.

    (1993)
  • Cited by (0)

    View full text