With the large-scale deployment of smart meters worldwide, research in non-intrusive load monitoring (NILM) has seen a significant rise due to its dual use of real-time monitoring of end-user appliances and user-centric feedback of power consumption usage. NILM is a technique for estimating the state and the power consumption of an individual appliance in a consumer’s premise using a single point of measurement device such as a smart meter. Although there are several existing NILM techniques, there is no meaningful and accurate metric to evaluate these NILM techniques for multi-state devices such as the fridge, heat pump, etc. In this paper, we demonstrate the inadequacy of the existing metrics and propose a new metric that combines both event classification and energy estimation of an operational state to give a more realistic and accurate evaluation of the performance of the existing NILM techniques. In particular, we use unsupervised clustering techniques to identify the operational states of the device from a labeled dataset to compute a penalty threshold for predictions that are too far away from the ground truth. Our work includes experimental evaluation of the state-of-the-art NILM techniques on widely used datasets of power consumption data measured in a real-world environment.