






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
....dfghjkmjhgfdsfghjkhgfdsfghjkhgfd
Typology: Thesis
1 / 12
This page cannot be seen from the preview
Don't miss anything!
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 68, NO. 11, NOVEMBER 2021 3205
Abstract — Objectives: Big data analytics can potentially benefit the assessment and management of complex neu- rological conditions by extracting information that is dif- ficult to identify manually. In this study, we evaluated the performance of commonly used supervised machine learn- ing algorithms in the classification of patients with trau- matic brain injury (TBI) history from those with stroke his- tory and/or normal EEG. Methods: Support vector machine (SVM) and K-nearest neighbors (KNN) models were gen- erated with a diverse feature set from Temple EEG Cor- pus for both two-class classification of patients with TBI history from normal subjects and three-class classification of TBI, stroke and normal subjects. Results: For two-class classification, an accuracy of 0.94 was achieved in 10-fold cross validation (CV), and 0.76 in independent validation (IV). For three-class classification, 0.85 and 0.71 accuracy were reached in CV and IV respectively. Overall, linear dis- criminant analysis (LDA) feature selection and SVM models consistently performed well in both CV and IV and for both two-class and three-class classification. Compared to nor- mal control, both TBI and stroke patients showed an overall reduction in coherence and relative PSD in delta frequency, and an increase in higher frequency (alpha, mu, beta and gamma) power. But stroke patients showed a greater de- gree of change and had additional global decrease in theta power. Conclusions: Our study suggests that EEG data-driven machine learning can be a useful tool for TBI classification. Significance: Our study provides preliminary
Manuscript received December 17, 2020; revised January 25, 2021; accepted February 22, 2021. Date of publication February 26, 2021; date of current version October 20, 2021. The work of Meijun Ye was supported by research fund of Office of Science and Engineering Labo- ratories (OSEL), Center for Devices and Radiological Health (CDRH) of the Food and Drug Administration (FDA). (Nicolas Vivaldi and Michael Caiola are co-first authors.) (Corresponding author: Meijun Ye.) Nicolas Vivaldi was with the Office of Science and Engineering Lab- oratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, USA. He is now with the MathWorks, USA (e-mail: nvivaldi16@gmail.com). Michael Caiola is with the Office of Science and Engineering Labora- tories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, USA. Krystyna Solarana was with the Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, U.S. Food and Drug Administration, USA. Meijun Ye is with the Office of Science and Engineering Laborato- ries, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA (e-mail: meijun.ye@ fda.hhs.gov). This article has supplementary downloadable material available at https://doi.org/10.1109/TBME.2021.3062502, provided by the authors. Digital Object Identifier 10.1109/TBME.2021.
evidence that EEG ML algorithm can potentially provide specificity to separate different neurological conditions. Index Terms —EEG database, machine learning (ML), traumatic brain injury (TBI), stroke.
RAUMATIC brain injury (TBI) presents a significant chal- lenge to civilian and military medicine. According to the Centers for Disease Control and Prevention (CDC), an estimated 2.5 million people sustain a TBI annually, contributing to a third of all injury-related deaths in the United States. Given the high societal and economic costs of untreated TBI, it is recognized as a significant military and public health concern. Currently, neurological Glasgow Coma Scale (GCS) is a clinical index universally used to classify TBI as mild, moderate or severe. CT scan is used to detect structural brain lesions. Though useful in the clinical management of TBI, these methods do not provide enough sensitivity to detect mild TBI and mon- itor the progression of TBI at different severities. Therefore, efforts are ongoing to seek for alternative clinical assessment tools for TBI, including body-fluid analysis, advanced imaging modalities (i.e., diffuse tensor imaging [DTI], positron emission tomography [PET]) and neurophysiological signals (i.e., eye movement and electroencephalography [EEG]). Among all the modalities, EEG has advantages of being non-invasive, easy-to-use, portable and cost effective. However, when applied to TBI research, EEG yields mixed results in the literature. Views on the clinical significance of EEG in TBI assessment are historically controversial [1]–[4]. Studies have shown significant differences in EEG-based power spectra data between mild TBI and normal groups [5], [6], while other studies report no such distinction [7]. Researchers have also evaluated post-TBI changes in connectivity [8], [9] and entropy [10], [11]. Abnormal electrophysiological signals were observed to occur without structural and biochemical changes following neural disruptive interventions, or even in the lack of apparent neu- rocognitive abnormality [10], [12], suggesting that EEG has the potential to be a sensitive indicator of neuropathology. However, how specific these changes are to TBI is questionable. With the advancement of computational analytical technolo- gies, the clinical utility of EEG signals may be propelled signif- icantly. Health-related research has benefited from data-mining machine learning (ML) techniques built on the increasingly
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
3206 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 68, NO. 11, NOVEMBER 2021
available wealth of information provided by large scale reposi- tories [13]. Due to the inherent complexity of TBI, including the absence of consensus on biomarkers, underlying relationships between data, and patient-to-patient variability, big data analyt- ics have the potential to make determinations about population characteristics that would otherwise be too difficult or impos- sible to manually identify [14], [15]. While clinicians receive extensive training to interpret EEG signals, advances in ML and deep learning may enable data-driven computational systems to emulate and even improve this process. Particularly, multiple kinds of temporal and spectral analyses can be performed on EEG recordings across multiple channels, generating feature sets that are well suited for ML applications. In addition, due to its long history of use in neurological conditions, multiple EEG databases are already formed and are publicly available. This provides a platform to evaluate the feasibility of the im- plementation of ML to the investigation of TBI assessment. Multivariate EEG data has previously been shown to be effective in classifying acute TBI patients with positive CT scans [16]. However, more study is necessary for broader application among groups with different stage and severity of TBI as well as different demographics. Here, we evaluated the performance of multiple commonly used ML algorithms in the classification of patients with TBI history from normal subjects with a diverse feature set composed of demographic information, power spectral density, channel- to-channel coherence, phase-amplitude coupling, and spectral entropy, from Temple EEG Corpus. In addition, we further assessed the accuracy of algorithms in the classification of TBI, stroke, and normal patients to determine the specificity.
A. EEG Data
Raw EEG data was obtained from the Temple University Hospital EEG Corpus repository (v1.1.0), the world’s largest clinical EEG database [17]. Subjects were identified through patient records associated with each EEG file (.EDF format). Records were parsed using Python scripts as described in paper [18] with key words listed in Sup Fig. 1 (available at - https: //github.com/dbp-osel/qEEG-consistency), further curated by a custom MATLAB (MathWorks, Version 2019b, Natick, MA, USA) script, then verified by manually inspecting the content of each automatically selected record to assure their compliance with inclusion and exclusion criteria listed below. 1) Inclusion Criteria: TBI group: 1) ages 1-85 and 2) med- ical record includes a diagnosis of TBI or concussion. Stroke group: 1) ages 1-85 and 2) medical record of a stroke diagnosis. Normal group: 1) ages 1-85 and 2) clinicians’ notes indi- cated that the EEG was within normal ranges for the subject’s demographic group. 2) Exclusion Criteria: TBI group: Documented history of epilepsy, seizure, tremors, or other neurological conditions other than TBI within the clinicians’ note record. Stroke group: Documented history of epilepsy, seizure, tremors, or other neurological conditions other than stroke within the clinicians’ note record.
Fig. 1. Flowchart describing data processing and model training. (ICA: independent component analysis, PSD: power spectral density, SE: spectral entropy, PAC: phase-amplitude coupling, LDA: linear dis- criminant analysis, PCA: principal component analysis, FSFS: for- ward sequential feature selection, BSFS: backwards sequential feature selection).
Normal group: Documented history of epilepsy, seizure, tremors, or other neurological conditions within the clinicians’ note record. All patient data in the database were de-identified. There- fore, this study did not constitute human subjects research, and was exempted from Food and Drug Administration institutional review board review. A total of 13550 subjects were analyzed and processed according to the description in Sup. Fig. 1. The final dataset for training machine learning models included 292 subjects with 79 normal labels, 98 TBI labels, and 115 stroke labels. A total of 26 normal labeled, 44 TBI labeled, and 50 stroke labeled subjects’ data were randomly reserved for use as an independent validation (IV) set. There was no overlap between any of the cohorts. It needs to be acknowledged that in the Temple database, information on the occurrence time of TBI/stroke was often unavailable, nor were the severity and cause of the injury con- sistently reported. Therefore, the composition of the patient was heterogenous in both diseases’ groups.
B. EEG Data Preprocessing Fig. 1 depicts the data processing, feature generation, feature selection, model training and validation flowchart. EEG signals from different subjects were first normalized so that individual records conform with one another in terms of channels used, length of time of the recording, consistent epoching, and sampling frequency (fs). Specifically, for each subject, 3 minutes of awake, resting-state, stimuli-free EEG recordings were included (excluding the first minute of the recording). EEG data were further pre-processed using MAT- LAB and eeglab (v.2020.0) [19]. Signals (fs = 250 Hz) from 19 common channels in ten-twenty standard arrangement (FP1,
3208 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 68, NO. 11, NOVEMBER 2021
TABLE I DEMOGRAPHIC INFORMATION OF NORMAL , TBI AND S TROKE S UBJECTS. P-VALUES CALCULATED W ITH CHI -S QUARE (S EX ) OR O NE WAY ANOVA AND P OST-HOC TUKEY TEST (AGE AND M EDICATION ). NO S TATISTICS W ERE DONE ON AGE RANGE.
E. Model Training
Feature vectors identified by statistics, PCA, LDA, FSFS, and BSFS were fed into Support Vector Machine (SVM) models using six different kernels and K-Nearest Neighbors (KNN) classifiers using six definitions for training. SVM used three polynomial kernels (linear, quadratic, and cubic) and three Gaus- sian kernels. Gaussian kernel scale was determined as 4∗ sqrt (N), where N = number of features, sqrt (N), and sqrt (N)/4 for mod- els named coarse, medium, and fine, respectively. Three KNN models were trained based on Euclidean distance for K = 1, 10, and 100, also denoted coarse, medium, and fine, respectively. Two more KNN models were trained with K = 10 using cosine and cubic distances. The last model trained was KNN with K = 10 where neighbors were weighted by the squared inverse of their Euclidean distance. For both raw and clean data using five selection methods and 12 total models, 120 total models were trained and tested. An additional 120 models were trained and tested as above, intentionally excluding demographic features. These model variations were trained on two-class data (normal vs TBI) and three-class data (normal vs TBI vs stroke), for a total of 480 models.
F. Model Validation
Performance of classifiers were validated with 10-fold cross- validation (CV), label randomization, and IV data set (Table I). Model accuracy, F1 score, sensitivity, and specificity were recorded to assess the performance of each model to classify TBI and normal data. For three-class classifiers to classify TBI, normal, and stroke, macro and weighted variants of preci- sion/sensitivity/F1 scores were calculated. In order to assess a baseline metric of performance, group labels (TBI, normal and stroke) were randomly assigned to the feature vectors and each randomized set was trained and tested following the same methods used for true labeled data. Both true and randomly labeled data were trained using 10-Fold
CV partitions 1000 times in order to generate distributions of cumulative prediction accuracies over each fold. In addition, classifiers generated with the full training data set were further evaluated by predicting classifications of an independent test data set with 120 true labeled subjects, which were withheld from the training data set (See Table I for demographic infor- mation of each data set). To further determine validation, a model was considered a “success” if its validation score was at least above that of the Zero Rule (ZeroR) benchmark. The ZeroR benchmark is calculated as the accuracy of a model that predicts the largest class no matter the input. For the normal vs TBI models, the ZeroR benchmarks were at 55.37% for the CV tests and 62.86% for the IV tests. For the three-class models, ZeroR was at 39.7% and 41.67% for the CV and IV, respectively.
G. Statistical Analysis Statistical analysis was performed to provide baseline mea- sure of the differences between the TBI, stroke and normal classes on the individual features, and to compare performance of models. One sample Kolmogorov-Smirnov test (K-S test) was used to determine normal distribution of the data at the 5% signif- icance level. If data was not normal distributed, Wilcoxon rank sum testing, or two sample K-S test, or signed rank test, were implemented to analyze differences. When data was normally distributed, Student t-test was used. Where it is applicable, posi- tive false discovery rate was calculated for multiple comparison correction. For categorical sex data, Chi square test was used. When comparing the five feature selection methods, One Way ANOVA and post-hoc Tukey test was used. All statistical data are expressed as mean ± std.
A. Performance of Individual Two-Class Algorithms All 1330 EEG features and 3 demographic features (sex, age, and medication) of normal and TBI cohorts were put through statistical analysis, LDA, FSFS, BSFS, and PCA for feature selection. Sex is set as categorical data. Medication is set as the number of unique medications prescribed to each subject. Drug interaction was not investigated in this study. 1) Models Trained With Features Selected By Statistics: In statistical analysis, rank sum and false discovery rate analyses identified 98 features out of the 1333 in the raw set (∼7.35%) and 82 in the clean set (∼6.15%) that were significantly different between TBI and normal subjects, including sex. When compar- ing the performance of models trained with truly labeled data and randomly labeled data with 10-fold CV, all models trained with truly labeled performed significantly better than randomly labeled data with the only exception of SVM fine Gaussian at 10 -10^ significance level (SL) (Sup. Fig. 2) (two sample K-S test, p< 10 -10^ , 1000 iterations). Models trained with truly labeled data had an average accuracy of 0.68 ± 0.05 with median at 0.68. In addition to 10-fold CV, we evaluated the performance of models with an independent data set which were withheld from training. In general, the performance of models to predict the classification of independent data set was better than 10-fold
VIVALDI et al. : EVALUATING PERFORMANCE OF EEG DATA-DRIVEN MACHINE LEARNING 3209
Fig. 2. Performance of models trained with features selected by linear discriminant analysis (LDA). The figure shows distribution of accuracies for 1000 iterations of training using randomly labeled data (orange), true labeled data (blue), and independent dataset (green line) for models based on features selected by LDA. Black and green dotted lines show ZeroR benchmarks for cross-validation (CV) and independent validation (IV) respectively. All models trained with true labeled data performed significantly better than randomly labeled data at 10-10^ confidence interval in 10-fold CV, except for SVM fine Gaussian models (two sample K-S test). (SVM: support vector machine, KNN: K-nearest neighbors).
CV (Sup. Fig. 2). The average accuracy of the 24 models was 0.70 ± 0.05 with median at 0.71. 2) Models Trained With Features Selected By LDA: With a threshold of the mean of the δ values plus one standard deviation, LDA selected 208 features from raw data features, and 224 from clean features, including two demographic features, sex and age. 1000 iterations of 10-fold CV showed that 22 out of 24 models trained with truly labeled data performed significantly better than those trained with randomly labeled data (Fig. 2) (n = 1000, p < 10 -10^ , two sample K-S test). Like models trained with statistically selected features, SVM algorithm with fine Gaussian kernel could not distinguish TBI and normal subjects at all. When classifying the independent data set, the 24 models showed an average accuracy of 0.76 ± 0.09 with median at 0. (Fig. 2), which is slightly better than the accuracy calculated with 10-fold CV, 0.70 ± 0.04 with median at 0.70. 3) Models Trained With Features Selected By FSFS: With the criterion described in Methods, FSFS selected sex and an additional 8 EEG features from raw data, and 12 from clean data. All 24 models trained with truly labeled data performed significantly better than those trained with randomly labeled data when evaluated with 1000 iterations of 10-fold CV at 10 - SL with a mean accuracy of 0.71 ± 0.06 with median at 0. (Sup. Fig. 3) (two sample K-S test). When these models were evaluated by the independent data set, they showed an average accuracy of 0.67 ± 0.05 with median at 0.67 (Sup. Fig. 3). 4) Models Trained With Features Selected By BSFS: Working backwards from the features selected by LDA, BSFS selected sex, age, and an additional 179 EEG features from raw data and 216 from clean data. Similar to the LDA models, all models trained with truly labeled data, with the exception of the SVM algorithm with fine Gaussian kernel, performed
significantly better than those trained with randomly labeled data when evaluated with 1000 iterations of 10-fold CV at 10 - SL. The mean accuracy of the truly labeled data set was almost as high as LDA at 0.75 ± 0.09 with median at 0.75 (Sup. Fig. 4) (two sample K-S test) which is not too surprising as these features were derived from the LDA feature set. However, when these models were evaluated by the independent data set, they performed especially poor with an average accuracy of 0. ± 0.06 with median 0.56 (Sup. Fig. 4). 5) Models Trained With Principal Component (PC) Fea- tures: When putting through all 1332 non-categorical features into PCA, 132 principal components (PCs) were necessary to reach 95% threshold using raw or clean data. Models were trained with the selected PC features and the categorical sex information. The majority models trained with true labeled data performed significantly better than randomly labeled data with p < 10 -10^ but the coarse and fine variants of SVM and KNN, as well as clean KNN cubic distance model (Sup. Fig. 5) (two sample K-S test). Additionally, 6 of the 24 models failed to have better accuracy than the Zero Rule (ZeroR) benchmark of 0.55 (black dotted line). When evaluated with 10-fold CV, the mean accuracy of these 24 models was 0.58 ± 0.05 with median at 0.57. The performance of the 24 models to classify the independent data set also performed unsatisfactorily, with a mean of 0.48 ± 0.10, a median at 0.43, and none of the models performing better than the ZeroR benchmark of 0.63 (green dotted line). We compared the performance of models with different fea- ture selection methods (Fig. 3a and b) and found that models trained with PC features performed the poorest in both 10-fold CV and IV. Models trained with features selected by LDA performed consistently well in both CV and IV. Interestingly, models trained with features selected by statistics performed
VIVALDI et al. : EVALUATING PERFORMANCE OF EEG DATA-DRIVEN MACHINE LEARNING 3211
Fig. 6. Changes in clean EEG features. (a) shows the fraction of features selected by statistics, LDA, and FSFS out of total number of features in each type of features (i.e., 171 coherence and 19 relative PSD features in each frequency band) without consideration of channels for 2-class classification. (b) shows the fraction of selected features for 3-class classification. (c) shows the median z score for each type of features in normal, TBI and stroke subjects respectively. (d) shows the broadband coherence change from normal to TBI. Main panel shows the median z score of coherence coefficients of all channel pairs. Inset demonstrates the channel pairs with median z score higher than 0.5 or lower than -0.5. (e) shows the topographic map of relative PSD based on z scores. (f) indicates the z score of stroke broadband coherence to TBI. Inset shows the channel pairs with median z score higher than 0.5 or lower than -0.5. (g) shows the topographic map of relative PSD z score of stroke subjects to TBI. (LDA: linear discriminant analysis, FSFS: forward sequential feature selection, Stats: statistics, PAC: phase-amplitude coupling, PSD: power spectral density).
accuracy for demographic group remains significantly higher (Sup. Fig. 6). Considering the significant difference in sex between normal and TBI groups in both training and IV data sets (Table I), this finding is not surprising.
C. Effect of EEG Artifact Removal on the Performance of Models
In conventional quantitative EEG (qEEG) analysis, artifact removal is an inevitable step, but it is computationally and time demanding. Moreover, there is no perfect artifact removal
method. Therefore, we are interested in understanding whether ML models built upon raw EEG can provide comparable perfor- mance with those on artifact-removed clean data. When com- paring the performance of models trained with clean and raw EEG features, we found that the raw models performed slightly better (Fig. 3e). Interestingly, when predicting independent data, it appears there is no such difference in performance between most models trained with raw EEG compared to those trained with clean data (Fig. 3f). This remains true, when looking at just those models that performed better than the ZeroR benchmark (Sup. Fig. 6).
3212 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 68, NO. 11, NOVEMBER 2021
D. Models With Best Performance in Classifying TBI
Though it appears that most ML models can distinguish TBI from normal subjects to some extent, demonstrated as signifi- cantly better performance than their counter models trained with randomly labeled data, their performance varies remarkably. Therefore, we identified models with accuracy higher than 0. in 10-fold CV and higher than 0.73 IV (Sup. Table I). In 10-fold CV, 12 models showed an accuracy higher than 0.9, all of which were SVM models based on LDA or BSFS feature selection with linear or polynomial kernels. In IV, 18 models had higher than 0.73 accuracy, but none of these were in the 12 best performers in 10-fold CV. These models were built upon features selected either by LDA, FSFS or statistics, including SVM and KNN models.
E. Performance of Three-Class Models
To determine if the models above were actually distinguishing differences between normal and TBI EEGs instead of just normal and abnormal EEGs, we included EEGs from a third cohort of stroke patients. Features were reselected using the same previous features selection techniques including clean, raw, demographic- free and demographic variants. These newly selected features were then used to train error-correcting output codes (ECOC) models with a one vs one coding design and the same SVM and KNN variants used in the two-class case as learners. Overall, 240 models were each trained. Accuracy was calculated on 1000 iterations of 10-fold CV for both truly and randomly labeled data (Sup. Fig. 7-11) as well as an IV. There was a large variation in the CV accuracies of these models with some just over 0.85 (Sup. Table II). All models had a median accuracy above 0.5 except for those trained with features selected by PCA (Fig. 4a). For the IV: LDA, FSFS, and conventional statistics methods performed the best with PCA still being the least accurate (Fig. 4b). Our results suggest that the features selected by LDA, FSFS, or BSFS are sufficient to produce adequate accuracy classifica- tion models between one normal group and two different disease cohorts. Ideally, with more data, this number would increase. Nevertheless, we believe this shows that our feature selection and resulting models, is detecting more than just differences between normal and abnormal EEGs. Put another way, if one were to assume the models were only detecting a difference between normal and abnormal EEGs then these models would only have an accuracy as high as
P (H) P
with H, T, S representing the ground truth of a subject’s cohort, H,^ ˆ T ,ˆ Sˆ representing the model’s classification of a subject’s cohort, and P (·) being the probability of selecting it, assuming they are chosen at random. That is, for the case that the model always selects subjects in the Normal cohort correctly, we’d have a CV accuracy as high as ( 79 294
and an IV accuracy as high as ( 26 124
In total, 127/240 of our models have CV accuracy greater than 0.54 and 97/240 have IV accuracy greater than 0.52. For the effect of demographic information and artifact re- moval on the performance of models for three-class classifica- tion, consistent with two-class models, including demographic information can increase the accuracy of models (Fig. 4c). How- ever, opposite to the two-class models, removing EEG artifact significantly increased the accuracy of the three-class models (Fig. 4d). Taken together, our results suggest 1) LDA is the best feature selection methods among those tested for our EEG dataset (Fig. 5a and 5b), 2) SVM models perform better overall, 3) raw EEG can provide comparable performance compared with artifact-removed clean EEG in two-class classification, but sig- nificantly inferior to clean EEG in three-class classification and
F. Differences in qEEG Features Between Normal, TBI and Stroke Patients As models built upon features selected by LDA, FSFS, or statistics performed best in both two- and three-class classifi- cation, we investigated the composition of these qEEG features and compared them between normal, TBI, and Stroke subjects. Fig. 6 and Sup. Fig. 12 show results obtained from clean and raw EEG respectively. When looking at the fraction of features selected in each qEEG category, most features selected by LDA, FSFS and statistics were PSD and coherence for both two- and three-class classification and both clean and raw EEG (Fig. 6a and 6b, Sup. Fig. 12a and 12b). To compare between normal, TBI and stroke subjects, we cal- culated the z score to the standard deviation of normal subjects. Fig. 6c and Sup. Fig. 12c show the median z score (with normal baseline) for each type of feature for all three cohorts. It appears that relative PSD, as well as coherence, had biggest difference in median z scores between the three groups. In addition, in clean EEG data, TBI and stroke patients had significantly reduced entropy compared to normal control. To further understand the change in coherence, we studied the coherence between all channel pairs between normal and TBI subjects in each frequency band. Though 20-40% of all channel pairs had significant changes across each frequency band compared to normal subjects, the change in coherence was more channel pair dependent rather than frequency band dependent (data not shown), which means the same channel pair often shows the same trend of change across all frequency bands. Therefore, we analyzed the broadband (1-40 Hz) coherence change in TBI for each pair of channels from normal subjects, which is plotted in Fig. 6d and Sup, Fig. 12d. We observed an overall reduction in broadband coherence when only channel
3214 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 68, NO. 11, NOVEMBER 2021
EEG for distinguishing and characterizing different neurological disorders, as the specificity of EEG for different types of neuro- logical disorders is still questionable, which has long prevented its widespread adoption in clinical practice. Our study provides a preliminary evidence on that by demonstrating that ML algo- rithm can yield high accuracy to separate TBI and stroke patients. Repository data, while highly desirable for big data and machine learning applications due to its sheer size, is difficult to manage in cases where documentation and metadata formatting is inconsistent. Management tools often require customization on the part of the user, which cause difficulties in open source sharing of both data and algorithms. Medical big data research especially, which aims to uncover relationships and distinc- tions between various populations, will greatly benefit from the continued efforts to normalize data collection and reporting procedures. In this study, differences in record formatting, gaps in information recorded, and unclear diagnostic outcomes were among the limiting factors in textual analysis, which in turn limits the potential information pool for processing.
B. EEG Machine Learning and TBI
With the development of advanced analytical techniques and improvement in computational capability, machine/deep learn- ing has been under intense investigation for implementation in multiple neurological fields, including mind decoding in brain- computer interface [26], identification of sleep-wake stages [27], prediction of seizures [28], and prognosis of stroke [29]. Thatcher pioneered utilizing machine learning in the TBI field by applying discriminant analysis of multivariant qEEG features to classify TBI patients and differentiate severe TBI patients from the mild [30], [31]. With 20 EEG features, he achieved an accuracy of >90% in cross-validation and IV. Thornton further tested using 31 high-frequency EEG features to distinguish mild TBI subjects and got an accuracy of about 87% [32]. Recent dis- criminative index developed from a large sample size and more homogeneous subject population was reported to have a >95% sensitivity to predict positive CT finding in acute TBI [16], and may perform better in monitoring functional recovery from a TBI compared to other clinical outcomes [33]. Furthermore, a multimodal study found that algorithms incorporating EEG signals into symptom questionnaires can increase the accuracy by 10% [34]. Due to the inherent flexibility and wide array of potential algorithmic combinations, as well as the constant advances being made in the field, it is reasonable to make the assumption that newer, more complex models would improve classification results. Here, we evaluated several common ML techniques with a range of parameters in order to determine potential utility in TBI classification tasks. In our study, we achieved an accuracy up to 94% in CV and 76% in IV. Six of the top 12 performing models in 10-fold CV (with the other 6 using BSFS based on LDA features) and 6 out of 18 best performing models in IV used LDA as the feature selection method. Since LDA can function independently as a binary classifier and has been most often used in prior EEG TBI reports, it is well suited for separating the classes examined here. When used as a multiclass classifier, a one vs one coding strategy was used, where only two classes
were looked at each iteration. The BSFS selection method was unique in that it started with the features selected from LDA, but removed any additional features it could, leaving us with a smaller feature set but comparable performance in 10-fold CV. However, the models trained with features from BSFS preformed significantly worse in IV. This is mostly likely due to overfitting on the original training data caused by optimization on an already optimized dataset. The LDA and BSFS methods selected 178-224 EEG features, PCA selected just over 130 features, conventional statistics selected 80-98 features, and FSFS selected 12 or fewer features. This is surprising since the other best performing models in the IV used features selected from the smallest number of features selected by FSFS or those showing significant difference sta- tistical differences. Overall, the features selected by LDA and conventional statistics shared up to 28 raw features, while only 2-4 raw FSFS features were found in both. Though the number of features from LDA and statistics appeared to be high relative to the number of subjects, we implemented multiple folds of validation to reduce the impact of overfitting, including 10-fold CV, randomizing the labeling, and IV. In addition, SVM was used for its good performance in handling high-dimensional data. Indeed, 21 out of the 30 best performing models were SVM models, with only nine trained with KNN kernels in two-class classification. And all best performers in three-class classification are SVM models. In future work, other methods such as LASSO and convolutional neural network (CNN) may be implemented to further improve the dimensional reduction and possibly the classification results. CNN has been shown to have superior performance in neurological applications compared to conventional algorithms and may provide higher sensitivity and specificity [35]–[38]. Although we had moderate success with CV accuracy, it is troubling that there was 18 point drop in IV accuracy. In general, a drop in performance between CV and IV is indicative of overfitting during training. A common way this is addressed is by increasing the sample size for training sets, which was not feasible here based on the inclusion/exclusion criteria and the data available in Temple EEG Corpus. A larger sample size for model training, with consistent data labels, could address this limitation. Alternatively, this study used randomly labeled data to generate a baseline performance margin for each model. IV classification consistently outperformed classification of sub- jects randomly assigned to a cohort, showing that, while overfit to the training set used for CV, the overall models were still able to generalize to unseen data. Future work will focus on im- proving IV accuracy, either through larger training sets or more advanced algorithms (deep learning, ensemble methods, etc.). As suspected, most models performed better than conven- tional statistics. The exception to this was the surprisingly poor results of the PCA results (Fig. 5). Since PCA relies on a linear transformation, it is possible that the features are better suited for nonlinear transformations. However, this is contradictory to the success found with the linear LDA SVM model, so further investigation is needed. The other high performing models (Sup. Table I and II) consist of mostly polynomial SVMs. This is perhaps due to their versatility to this set of heterogenous set of data that could not be captured in the KNN models.
VIVALDI et al. : EVALUATING PERFORMANCE OF EEG DATA-DRIVEN MACHINE LEARNING 3215
It also needs to be noted that almost all models published for TBI classification utilized supervised learning. However, one of the challenges in the detection and monitoring of TBI is the lack of an early and sensitive outcome measure. This restrains the performance of classifiers within our current knowledge breadth. McCrea et al. [33] reported that an EEG-based algorithm could potentially be more sensitive than conventional neurocognitive assessment in monitoring the recovery from TBI. Our own study in mice also suggests EEG changes can be observed without an apparent neuroinflammatory reaction [12]. Therefore, in the future, an unsupervised approach can be explored to mitigate this limitation. Our study also explored the effect of demographic information and artifact removal on the performance of models. Due to the significant difference in the rate of TBI between males and fe- males, including demographic information can slightly increase the accuracy of models developed with the same algorithm. However, its effect on the model performance is less than input features, algorithms, and kernels. Interestingly, models devel- oped from features calculated from raw EEG data demonstrated comparable performance with those trained with clean EEG features in two-class classification. However, artifact-removal significantly increased the performance of three-class models. We speculated that some information embedded within artifact, i.e., eye movement, could be different between TBI and normal subjects, explaining this discrepancy. A further investigation into the IClabels removed in our work indeed show a significant difference in the number of eye movement artifacts between normal and TBI groups (Sup. Fig. 13). Moreover, a significant difference in the number of muscle and eye movement artifacts was revealed between TBI and stroke patients. In addition, it appears that artifact removal noticeably changed the coherence difference between stroke and TBI subjects. This may par- tially explain why artifact removal did not affect the two-class model performance but increased the performance of three-class models.
C. QEEG Differences Between Normal, TBI and Stroke
Though identifying qEEG biomarkers was not the primary goal of this study, understanding changes can help us compare results from the database with prior reports, and determine the features that can significantly contribute to a well-performing model. LDA, statistics, and FSFS selected a remarkable fraction of features from coherence and relative PSD for both two- and three-class classification (Fig. 6a). Stroke and TBI differ in the cause of brain injury (internal vs. external), whereas share some pathological processes including the primary cranial cell death and blood-brain-barrier disruption followed by secondary neuroimmune responses triggered by cytokines. Due to the sim- ilarities in pathophysiology and associated functional deficits between these two conditions, specificity of EEG in distinguish- ing these two conditions is always questionable. Indeed, our study suggests similar trend of qEEG changes in coherence and relative PSD between the two groups with changes in stroke patients more prominent. However, ML models reasonably sep- arated these two groups, suggesting advanced analytics can
potentially be more sensitive to identify differences compared to conventional statistics. In literature report, the trend of change in coherence related to TBI and stroke varies, however, the most reported was the reduction in global or inter-hemisphere coherence [6], [8]. Other reports suggest the change can be pathway-specific [30], [39]. Our study seems to support both. The grouped analysis (Fig. 6d inset) shows a symmetric pattern between two hemispheres in TBI patients. Though this does not mean the same patient would have symmetric changes, it suggests the same pathway in both hemispheres are equally susceptible to the same change. When comparing stroke to TBI, stroke patients showed a further reduc- tion in global coherence, reflecting a more severe interruption of inter-neuronal communication, which is consistent with prior report [40]. Unlike diverse findings in coherence change after TBI and stroke, reports in PSD alteration were more consistent with an increase in lower frequency bands (delta and theta) and a reduction in higher frequency bands (alpha, beta and gamma) [41], [10], [6], [42]. Our study revealed the same trend of change in relative PSD as shown in Fig. 6e and 6g, however, stroke had additional increase in theta power. In addition to coherence and PSD, LDA extracted several entropy features in raw EEG, and a large number of entropy features were found significantly different from groups in clean data. Different metrics for entropy have been employed to iden- tify EEG biomarkers of TBI and stroke. A decrease at the acute phase of injury followed by a recovery was mostly reported in both animal and human studies for TBI [12], [43], [44]. Though TBI subjects in the Temple database were diverse, which may include chronic injury with entropy recovered and those with local injury [45], an overall reduction was remarkable in clean EEG. For stroke patient, an increase in sample entropy was reported previously [40], this is controversial to what we found in clean EEG data that a significant reduction was revealed in stroke group. Further studies on entropy changes in stroke and TBI patients are needed to determine its post-stroke and post-TBI alterations. The reason that clean data demonstrated entropy change but not raw EEG may be because the cleaning process utilizes features associated with entropy. Since signal noise increases overall variance, removing artifacts through cleaning methods reduces noise and therefore should alter extracted entropy features. Without artifact signals (i.e., eye-movement) buffering entropy values, changes in these features can be more easily attributed to the classes. While further work is necessary to develop clinically appli- cable spectral feature biomarkers and accompanying diagnostic models for TBI, research indicates that EEG data provides a measure of separability between normal and TBI subjects, and with potential to separate TBI from other neurological condi- tions, i.e., stroke. Other non-invasive, portable modalities may be combined with EEG to enhance the available information within the feature set for these types of analyses. Future work will investigate the biological basis of the relationship between selected features and TBI pathology, as well as algorithmic improvements to modeling neurological disorders, for classi- fication purposes.