This semester QIC welcomes Natalie Paquette as a Human Factors Intern! Natalie is a Ph.D. student in the Human Factors and Cognitive Psychology program at the University of Central Florida (UCF). Natalie earned her MA in Applied Experimental and Human Factors Psychology in 2020 at UCF and her MA in Psychology focused on Cognitive and Behavioral Neuroscience at George Mason University in 2017. Her work has examined performance issues related to mismatched expectations, reliance on visual working memory, and the effect of restricted time intervals on error processing. Natalie’s interests include examining the neurophysiological and perceptual aspects of cognition and performance in various environments to determine optimal parameters for successful task completion.
This semester QIC welcomes Nicolas Uszak as a Human Factors Intern! Nicolas is a Ph.D. student at the University of Central Florida’s Human Factors and Cognitive Psychology Program. Nicolas has a Master’s in Applied Experimental Human Factors and a graduate certificate for Design Usability in Industrial Engineering, both from UCF. Previously he graduated summa cum laude with his B.A. in Psychology from Cleveland State University. His interests lie in motivation, situational awareness, automation, multi-tasking, vigilance, and machine learning. Nicolas is currently working on his dissertation involving situational awareness while operating automated vehicles.
I've attended and presented at several conferences this year, such as the Human Systems Digital Experience, World Aviation Training Summit (WATS), and Applied Human Factors and Ergonomics (AHFE), and have noticed a simple yet powerful construct appearing over and over again…self-efficacy. Self-efficacy is "concerned with people’s beliefs in their capabilities to produce given attainments" (Bandura, 2006). In other words, it's the confidence in the ability to exert control over one's own motivation, behavior, and social environment (Carey & Forsyth, 2009). Extensive evidence has shown self-efficacy to be a significant predictor across a variety of contexts and domains, such as college academic performance (Choi, 2005), pre-career pilot performance (Wilson, 2021), weight loss success (Armitage et al., 2014), health management (Arslan, 2012), second language skills (Raoofi, Tan, & Chan, 2012), and work burnout and engagement (Ventura, Salanova, & Lloren, 2015). While there are a host of other areas that have explored the role of self-efficacy, one of particular interest to me that has been gaining more attention is usability.
Usability assessments typically focus on effectiveness, efficiency, and satisfaction, but research suggests the integration of self-efficacy can provide a robust assessment (Martin, 2007). As the preponderance of technological solutions continuously diffuses across all aspects of our personal and work lives, our dependency on them will impact our ability to complete tasks. Therefore, belief in our ability to complete a task with a technological solution should impact the way in which these solutions are designed. Plenty of evidence exists indicating the role of self-efficacy in the adoption of technology, such as mobile learning solutions (Bettayeb, Alshurideh, Al Kurdi, 2020), desktop virtual environments for learning (Makransky & Petersen, 2019), fitness devices (Rupp, Michaelis, McConnel, Smither, 2018), and medical support tools (Lindblom, Gregory, Wilson, Flight, & Zajac, 2012).
Although some usability measures focus on or integrate the concept of self-efficacy, not all are implemented correctly based on Bandura's guidance (2006). One major flaw is using Likert-type bipolar ratings (e.g., strongly agree to strongly disagree) instead of unipolar ones (e.g., 0 to 10). The issue is if you have zero confidence in your ability to complete a task, then negative ratings below zero make little sense and lead to skewed interpretations of the results. Further, when bipolar ratings are used, the mid-point (usually labeled as neither agree nor disagree) gets converted into a moderate-level of self-efficacy which is incorrect and not a true reflection of the construct (Bandura, 2012). Leveraging self-efficacy as a usability metric can provide valuable insight into the design and evaluation of technology, but it's critical that measures be developed, implemented, and interpreted appropriately.
Bandura (2012) stated that "there is no single all-purpose measure of self-efficacy with a single validity coefficient." This indicates that it's expected for new measures of self-efficacy to be created for, as he puts it, "activity domains." Activity domains are the topic areas in which the tasks under evaluation are performed. For example, evaluating self-efficacy for driving a monster truck. Use these guidelines when developing your measures and scales for usability evaluations (or any evaluation for that matter):
Have you been capturing self-efficacy as part of your usability assessments? Tell us how.
Armitage, C. J., Wright, C. L., Parfitt, G., Pegington, M., Donnelly, L. S., & Harvie, M. N. (2014). Self-efficacy for temptations is a better predictor of weight loss than motivation and global self-efficacy: Evidence from two prospective studies among overweight/obese women at high risk of breast cancer. Patient Education and Counseling, 95(2), 254-258.
Arslan, A. (2012). Predictive power of the sources of primary school students' self-efficacy beliefs on their self-efficacy beliefs for learning and performance. Educational Sciences: Theory and Practice, 12(3), 1915-1920.
Bandura, A. (2006). Guide to the construction of self-efficacy scales. In Pajares, F., Urdan, T. (Eds.), Self-efficacy beliefs of adolescents, Vol. 5: 307-337. Greenwich, CT: Information Age.
Bandura, A. (2012). On the functional properties of perceived self-efficacy revisited. Journal of Management, 38(1), 9–44.
Bettayeb, H., Alshurideh, M. T., & Al Kurdi, B. (2020). The effectiveness of mobile learning in UAE universities: a systematic review of motivation, self-efficacy, usability and usefulness. Control and Automation, 13(2), 1558-1579.
Carey, M.P. & Forsyth, A.D. (2009). Teaching tip sheet: Self-efficacy. American Psychological Association. https://www.apa.org/pi/aids/resources/education/self-efficacy.
Choi, N. (2005). Self‐efficacy and self‐concept as predictors of college students' academic performance. Psychology in the Schools, 42(2), 197-205.
Lindblom, K., Gregory, T., Wilson, C., Flight, I.H.K., & Zajac, I. (2012). The impact of computer self-efficacy, computer anxiety, and perceived usability and acceptability on the efficacy of a decision support tool for colorectal cancer screening, Journal of the American Medical Informatics Association, 19(3), 407–412.
Makransky, G. & Petersen, G. B. (2019). Investigating the process of learning with desktop virtual reality: A structural equation modeling approach. Computers & Education, 134, 15-30.
Martin, C. V. (2007). The Importance of self-efficacy to usability: Grounded theory analysis of a child’s toy assembly task. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 51(14), 865–868.
Raoofi, S., Tan, B. H., & Chan, S. H. (2012). Self-Efficacy in Second/Foreign Language Learning Contexts. English Language Teaching, 5(11), 60-73.
Rupp, M. A., Michaelis, J. R., McConnell, D. S., & Smither, J. A. (2018). The role of individual differences on perceptions of wearable fitness device trust, usability, and motivational impact. Applied Ergonomics, 70, 77-87.
Ventura, M., Salanova, M., & Llorens, S. (2015). Professional self-efficacy as a predictor of burnout and engagement: The role of challenge and hindrance demands. The Journal of Psychology, 149(3), 277-302.
Wilson, N. (2021). Pre-Career Pilots and Motivation – What is the Best Predictor of Performance? 23rd World Aviation Training Summit (WATS), Orlando, FL, June 15-16, 2021.
The major challenge with the use of artificial intelligence (AI) is that it is often difficult to explain how AI or machine learning (ML) solutions and recommendations come to be. Previously this may not matter as much because AI’s use was limited and its recommendations were confined to relatively trivial decisions. In the past few decades, however, AI use has become more pervasive and some of these AI solutions are impacting high-stakes decisions, so this problem has become increasingly important. Fueling the urgency are findings that AI solutions can be unintentionally biased, depending on the type of data used to train the algorithms. For instance, the algorithm used by Amazon to hire staff was found to be biased against women because the algorithm was trained on previous data that largely comprised resumes from male applicants (Shin, 2020). Algorithms used by COMPAS (i.e., Correctional Offender Management Profiling for Alternative Sanctions) to predict likelihood of recidivism were also found to be predict that black offenders were twice as likely to reoffend compared to white offenders (Shin, 2020). These are only two of many instances of algorithm bias.
These algorithms tend to be from “black box” models that are developed from powerful neural networks performing deep learning. These neural networks are usually what researchers have to use for computer vision and image processing which are less amenable to other AI/ML techniques. Researchers in the field of explainable AI (or XAI) have tried to shine a light into these “black boxes” in different ways. For instance, one popular way to help explain deep learning AI processing images is to use heat or saliency maps that show the regions/pixels of a picture that seem to highly influence the algorithm’s prediction (i.e., what the network is “paying attention to”). If these regions aren’t relevant to the algorithm’s task at hand, then the researcher may be looking at a biased algorithm. In her recent talk at a Deep Learning Summit, Rohrbach (2021) showed an example of wrong captioning by a black box model, mislabeling the woman sitting at a desk in front of a computer monitor as “a man” (Figure 1a).
The saliency map showed that the network was attending more to the computer monitor than the person in the picture (Figure 1b), when it should have been focusing more on the person (Figure 1c). Presumably the bias arose because the training data contained more images of men sitting in front of computers than women doing the same.
Saliency maps also provide a way to evaluate the “logic” of the algorithm even when the captioning seems appropriate (see Figures 2a & 2b).
A major criticism of such saliency maps is that they only show the inputs of importance in deriving the algorithm. They do not reveal how these inputs are used. For instance, if two models had different captions or predictions but had very similar saliency maps, the maps would not explain how the models reached their different predictions (Wan, 2020).
Due to such limitations, other researchers such as Dr. Cynthia Rudin, propose that instead of making AI explainable, interpretable models should be developed instead. The difference being that interpretable models are inherently understandable, and the researcher is able to see how the model derives its solutions and algorithms, instead of merely trying to coax explanations from a model by reviewing what inputs were more influential than others after the algorithm has been developed (Rudin, 2021). Dr. Rudin’s work to make neural networks interpretable has received wide acclaim and she is a strong proponent of using interpretable models especially for high stakes decisions (Rudin, 2019). However, as she stated in the recent Deep Learning Summit, Dr. Rudin also admitted that developing such interpretable models takes more time and effort. This is because unlike black box models where researchers would not know if the model is working properly, interpretable models force researchers to work to troubleshoot the data when they see that the model is not working as it should – even if the solutions look ok (Rudin, 2021).
Nevertheless, there are some who still argue for the use of black box models that are low on explainability. Black box models are harder to copy and can give the company that developed them a competitive advantage. They are also easier to develop. Proponents of this view believe that the goal is not to explain every black box model but to identify when to use black box models (Harris, 2019). If a black box algorithm is able perform its task to a high degree of accuracy, we may not need to know exactly how it did it. Besides, what is a valid explanation to one may not be a valid explanation to another. These researchers argue that even physicians use things that they do not fully understand all the time, to include common drugs that have been shown to be consistently effective even though no one totally comprehends how they work in every patient. What is important is that that enough testing is done to ensure that the algorithm is dependable and suitable for its intended use (Harris, 2019).
When I first learned and read about AI and the problem of inexplicability some years back, I remember taking the position that the ends justified the means. That is, so long as the algorithm was accurate I’d be willing to sacrifice explainability. However, as I learn more about how people are using AI solutions for all sorts of important medical, hiring, and criminal justice decisions, I started to reconsider my position. After attending the Deep Learning Summit on Explainable AI (XAI) earlier this year, I am more inclined to think that it really depends on the application and type of decision involved. Just because a black box model has a sterling track record of highly accurate predictions in the past does not mean that it is not possible for the next prediction to be poor. The higher stakes the decision is, the better understanding we should have of the workings of the AI.
What do you think? For what applications or decisions would you accept a model that is unexplainable but have been consistently accurate?
Harris, R. (2019). How can doctors be sure a self-taught computer is making the right diagnosis? Retrieved from https://www.npr.org/sections/health-shots/2019/04/01/708085617/how-can-doctors-be-sure-a-self-taught-computer-is-making-the-right-diagnosis
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.
Rudin, C. (2021). Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Deep Learning Summit 2021.
Rohrbach, A. (2021). Explainable AI for addressing bias and improving user trust. Presented at the Deep Learning Summit 2021.
Shin, T. (2020). Real-life Examples of Discriminating Artificial Intelligence: Real-life examples of AI algorithms demonstrating bias and prejudice. Retrieved from https://towardsdatascience.com/real-life-examples-of-discriminating-artificial-intelligence-cae395a90070
Wan, A. (2020). What explainable AI fails to explain (and how we fix that). Retrieved from https://towardsdatascience.com/what-explainable-ai-fails-to-explain-and-how-we-fix-that-1e35e37bee07
Preprints first caught my eye in May 2020. As a human factors researcher who has researched wearables for several years and owned a few (remember the Jawbone UP?) my interest was piqued by a Washington Post headline, “Wearable tech can spot coronavirus symptoms before you even realize you’re sick” (Fowler, 2020). The wearable mentioned in the Post article, the Oura ring, must have already been developed before COVID-19 reached the U.S. and a novel application was developed for the existing hardware. Physical and digital prototypes go through iterative design and development processes and while the process can be rushed, speed comes at the expense of quality (e.g., Cyberpunk 2077). Quality is paramount when developing an application for monitoring individual and public health during a pandemic. The initial findings Fowler referenced were reported in preprint form. Preprints are scholarly works posted by researchers before the manuscripts have undergone peer review. Scientific research must be developed over time, much like hardware and software. Studies take months and even years to design, develop, pilot, collect and analyze data, interpret findings, and report the results. Peer review, the process of the scientific community evaluating research for its scientific soundness and practical and applied merit, takes additional months. Preprints are a way for scientists and researchers to put their (unvalidated) findings into the world, thus making sure nobody else can gain credit for the work. No industry standard exists for preprints. Do “initial findings” contain the results of two study participants or 200? Work-in-progress papers have been presented at academic conferences for years. Technical reports are another avenue for quickly reporting research. Why are preprints, which are posted online for the world to consume before peer review has vetted the work, necessary for researchers to make sure no one else receives credit for their work?
People may rely on news headlines and information gained through word-of-mouth for health advice. There’s a lot of money to be made. Newspapers need headlines to drive subscriptions, tech startups need investment capital, and researchers need funding. The financial incentives reinforce the people involved until the echo chamber increases sales. The preprint problem is magnified when journalists, who are also looking to not get scooped, take the initial findings and generate headlines that may be contradictory to the final, validated results. What happens if a study that’s initially reported as a preprint is found to be invalid? Is the preprint pulled from the internet? Does the news media issue a retraction? Will future researchers who are looking to develop and test hypotheses be able to distinguish between a preprint and a peer reviewed study? Some article repositories are taking steps to address this issue by clearly labeling which articles are preprints, such as medRxiv (https://www.medrxiv.org/). The following cautionary statement is prominently displayed on medRxiv’s homepage (emphasis medRxiv’s): Caution: Preprints are preliminary reports of work that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information. This is a good first step; however, it is incumbent on journalists and news outlets to responsibly report information. For example, a reporter could state that research into using wearables to aid in early COVID-19 diagnoses is ongoing while also refraining from mentioning any features or capabilities of devices that have not been validated.
The results of the first Oura studies are promising. Maybe the Oura ring and similar devices can detect the symptoms of COVID-19 before most people would otherwise spot them, but we don’t know yet. Validation takes time. In the meantime, wash your hands, wear a mask, and physically distance as much as you can. Relying on an unvalidated, non-FDA approved device for disease prevention and detection may lead some people to have a false sense of security when their wearable does not, in fact, indicate they may be ill. This false sense of security could then lead to infecting others. The scientific community can more clearly indicate that preprints are not to be used for individual or organization-level health guidance, as medRxiv has done. Scientists and researchers can choose to not cite preprints in their work, since the issue of what happens when a preprint is invalidated and retracted is still unclear. Companies and institutions can choose to not use preprints as a basis for hiring, promotion, and tenure. The issue of whether researchers should use information reported in preprints as a foundation upon which to scaffold scientific theories needs to be answered. We stand on the shoulders of giants, but without a firm footing we risk regressing down a slippery slope.
What are your thoughts on preprints? Let us know, below!
Fowler, G.A. (2020 May 8). Wearable tech can spot coronavirus symptoms before you even realize you’re sick. The Washington Post. https://www.washingtonpost.com/technology/2020/05/28/wearable-coronavirus-detect/
These posts are written or shared by QIC team members. We find this stuff interesting, exciting, and totally awesome! We hope you do too!