The RF classification model consistently outperformed the KNN and SVM models. The difference between models were negligible in the calibration setting but increased drastically in free living validation. The lab-setting RF model classification accuracy of activities at the workplace was consistently high except for standing and sitting (Fig. 1). In the free-living setting on the other hand, the classification accuracy was initially low across all activities (Fig. 2). After combining standing and sitting to stationary activity as well as combining stair ascending and descending to stair walking, the level of accuracy for both activities increased in the lab and free-living environment (Fig. 3-4). The good overall performance of the second RF model in the free living (71%) can be explained as 99% of the samples captured consisted of either walking or stationary activity.
Combining sitting and standing and not being able to distinguish between the two might be considered a major shortcoming of the classification model. On the other hand, although there is a small increase in energy consumption from standing compared to sitting [13, 14], standing is still considered a sedentary activity  and there are no cardiovascular health benefits with standing compared to sitting . During both sitting and standing the feet are usually parallel to the ground and no movement occurs. Since the inclination and movement of the sensor is used for classification, this explains the difficulty discriminating the two stationary activities. Stair ascending and descending were also combined to a single activity in the second model. However, these activities are associated with significantly different energy expenditure as opposed to sitting and standing . The estimated workload from stair walking might therefore be underestimated. Although, in most cases, stair descending and ascending could be assumed to be equally distributed.
The initial classification models’ performance were poor for all free-living activities (Fig. 2). The reason only standing/sitting and stair ascending/descending was combined was that these activities was clearly mixed up with each other but not with any other activity. With the other activities, the misclassification was more spread out. The differentiation between walking and stationary activities could probably just as well have been performed using an acceleration intensity metric alone . However, activity type might be a more applicable output for the workplace than the abstract intensity measures.
Other weaknesses of the study are the significant sex, age and BMI differences between subjects in the two study parts. The participants in the validation group consisted of more men, were older and had higher BMI than the participants in the calibration group, which might have affected the classification performance. It should also be considered that the validation was performed indoors only whereas parts of the calibration was performed outdoors. Although the outdoor walking in the calibration part was also done at slow pace, most of the indoor walking in the validation part might have been done at even slower pace. Calibration of stair walking and weight carrying was performed indoors and at slower speeds than the normal walking speed, which could explain the misclassification of free-living walking into these activities (Fig. 4). The workers in the logistics warehouse were covering larger areas while walking, whereas the workers in productions were mainly walking a few steps between machines. Covering larger areas could make the difference between walking and stationary more prominent and explain the higher accuracy in with the logistics warehouse data.
Although direct observation is considered the criterion method for activity classification in a free-living setting , this method is not perfect. It has been shown that direct observation has an accuracy of 87% where the activity classification of senior researchers was considered the reference . In a free-living setting, the activities might not be equivalent to the standardized lab-activities. Most of the validation data consisted of standing work that was stationary most of the time with walking a few steps in between (Fig. 2), which could be difficult to define with the current classification scheme.
Many studies on accelerometer based machine-learning classification models have been published previously, most of them using similar techniques as the current study [9, 18]. We have only found one other study that investigated the performance of a lab calibrated machine-learning method in a free living setting and there the accuracy was 49–55% . The accuracy of the current study is substantially higher at 71% (Fig. 4) although it is very low with some activities. However, the activities classified are different in the two studies. Lab calibration of activity classification may be prone to overfitting, even when validating the model using leave one subject out cross validation . Nevertheless, RF classification models are in general relatively robust to over fitting, but on the other hand may perform poorly on data that deviate much from the training data . The main limitation in generalization of lab developed activity classification models is thought to be the diverse activity types, different characteristics within each activity type and individual variation . The limited number of samples with other activities than stationary and walking at the two workplaces in the validation part limits further analysis of the accuracy of the classification of these activities. The classification model might perform better in a setting where other activities are more common. The difference in accuracy between workers in the logistics warehouse and industrial production also supports this assumption. A more diverse free-living dataset with a more even activity distribution could also be used to improve the classification algorithm further by analysing the temporal structure of activities .
Certain work-related physical activity patterns are suggested to have a negative health effect. For example, prolonged physical activity elevates 24-h heart rate and static postures and lifting is suggested to elevate 24-h blood pressure . Such patterns might be detected by the activity classification system suggested in this paper. Continuous monitoring of workload among industrial workers could be used in many ways for preventive measures and improving health. The monitoring gives basic data on the distribution of physical workload across different tasks at the workplace. This can be utilized when partitioning tasks between workers, both with regard to sharing heavy work between more employees and to lower the physical demand on specific individuals. Continuous monitoring also provides the possibility to follow workload over time, which might enable identification of employees getting fatigued at an early stage. This could potentially result in an overall less long-term sickness absence and better health status among workers . However, constant monitoring of activities during work do raise concerns regarding privacy of the workers .