Inclusion in practice: a systematic review of diversity-focused STEM programming in the United States

Colleges across the United States have shown a commitment to advancing diversity in the STEM fields by creating programs aimed at improving outcomes of women and/or racially and ethnically minoritized students. However, most existing literature focuses on the successes of singular college programs rather than comparing these STEM interventions across the higher education landscape. This systematic review investigates the literature on diversity-focused “STEM intervention programs” (SIPs) at the postsecondary level. We categorize key features of these programs and their outcomes, and we look at which program components have the most empirical support. We examine 82 articles that reported on SIPs with disaggregated outcomes, coding each initiative’s features and outcomes. Across these articles, we found six common program components, with most programs including more than one component, and five common program outcomes. Just 53 articles tested differences in outcomes of participants relative to a comparison group. This subset of research found support for the effectiveness of all coded components for improving student outcomes, though studies of multi-component programs did not parse the relative contributions of each component. Based on these findings, we conclude multi-component interventions that create a welcoming environment and focus on the successes of minoritized students help redress existing institutional shortcomings and are a promising step towards diversity, equity, and inclusion in STEM. However, more rigorous quantitative studies are needed to empirically assess the effectiveness of individual SIP program components.


Introduction
The lack of gender and racial parity in the field of STEM has been studied for more than four and a half decades (Kanny et al., 2014; National Center for Science and Engineering Statistics [NCSE], 2021;Ong et al., 2011). Yet students belonging to certain ethnic and racial groups-including Latinx, Indigenous, and Black/African-American students-still earn a disproportionately low percentage of bachelor's degrees in STEM fields in comparison to their representation in the general population of the United States (Fry et al., 2021;NCSE, 2021). Black and Latinx STEM majors transfer out of those majors more often than White STEM students (Flynn, 2016). Furthermore, while women earn almost equal numbers of science undergraduate degrees as men, the number of women awarded degrees in "hard science" fields like computer science and engineering is very low (Fry et al., 2021;NCSE, 2021). Black and Latinx workers and women have historically been underrepresented in STEM occupations, and they are particularly underrepresented in the highest-earning STEM occupations in tech, computer science, and engineering (Funk & Parker, 2018;Muro et al., 2018;U.S. Bureau of Labor Statistics, 2021). In turn, the racial/ethnic and gender wage gaps are even larger in STEM occupations than in non-STEM occupations (Funk & Parker, 2018). Women of color are affected by racial and gender trends simultaneously, a phenomenon labeled within STEM literature as "the double bind" (Hall & Sandler, 1982;Ong, et al., 2011). The Pew Research Center reports that "[c]urrent trends in STEM degree attainment appear unlikely to substantially narrow these gaps", even "amid longstanding efforts to increase diversity in STEM" (Fry et al., 2021, pp. 4-5).
The enduring struggle to diversify STEM fields necessitates continued research into diversity-focused interventions. As Kanny et. al. (2014) note, that research on diversity, equity, and inclusion (DEI) in STEM has continued for so long without solving issues of underrepresentation points to a major gap in both literature and praxis. This article examines one prevalent tactic used throughout the United States to encourage STEM persistence: "STEM intervention programs" (SIPs), college programs dedicated to helping students historically underrepresented in STEM to prepare for and graduate from STEM fields (Rincón & George-Jackson, 2016a, p. 743, 2016b. In this article, we refer to groups that have been historically underrepresented in STEM, particularly in technology, computer science, and engineering, as minoritized populations, where minoritized describes groups with STEM outcomes tied to their experiences of historical and present marginalization by dominant group members (Chase et al., 2014, p. 671).
While STEM programming is different at every institution, certain program features and components appear repeatedly across the higher education landscape (Castro, 2014;Chubin et al., 2015;Rincón & George-Jackson, 2016a, 2016bTsui, 2007). However, for many years there has been a lack of up-to-date, comprehensive reviews of STEM college programming in the United States, especially ones that look into these different features. Thus, the purpose of this article is to systematically review DEI-focused STEM interventions, categorize these programs' features and outcomes, and hypothesize patterns of association between STEM programming reported in the literature and outcomes for minoritized populations in STEM, particularly in technology, computer science, and engineering. We ask: what are the features of STEM programs that produce positive outcomes for underrepresented minorities? We argue that diversity-focused STEM programs have clear features associated with successful outcomes and that overall, DEI-focused STEM programs show promise in fighting the discriminatory environments and lack of institutional support in many college STEM departments.

Institutional failures to support diversity in STEM
Educational institutions serve as a primary site of experiences and opportunities that shape students' STEM interest, efficacy, and outcomes (Fouad & Santana, 2017;Lent & Brown, 2019). A large body of literature has documented ways educational institutions have failed to support STEM diversity at the postsecondary level. This literature has focused on how academic environments cause students of color and women to feel excluded, how schools provide insufficient academic preparation to minoritized youth, the prevalence of overly complex institutional course structures and financial aid requirements, and other institutional shortcomings.
Numerous studies have found that the climate of STEM higher education programs is often unwelcoming for certain minoritized populations. The "chilly climate" theory (Hall & Sandler, 1982) was coined almost 40 years ago and continues to be discussed in contemporary literature on higher education (Bottia et al., 2021;Giles, 2015;Lee & McCabe, 2020;Morris, 2003;Rincón & George-Jackson, 2016a;Rolin, 2008). According to this theory, minoritized students in STEM may experience discrimination in almost every aspect of college life, from interactions with peers to faculty to administrators, due to a "chilly" culture which tacitly allows discrimination and hostility towards these students (Bottia et al., 2021;Giles, 2015;Kanny, et al., 2014;McGee, 2020;Ong, 2005;Rolin, 2008). For example, Lee and McCabe (2020) found that "gendered expectations", including those perpetuated by professors, may lead to an environment where female students speak less and/or more hesitantly during STEM classes (p. 50). Although the phrase "chilly climate" originally aimed to encapsulate women's negative experiences in higher education, it has been expanded to include male students from minoritized ethnic and racial groups as well (Bottia et al., 2021;Giles, 2015;Hall & Sandler, 1982;Harper, 2012). Building on this work, Harper (2012) and Giles (2015) assert that higher educational environments are characterized by discriminatory actions and barriers better labeled as racist and sexist than by euphemistic terms like "chilly". Lord et. al. (2009) add, "If the climate has been characterized as 'chilly' for women […] the terrain is 'icy' for minority women" (p. 170). Many negative behaviors that contribute to a chilly environment for women and racial and ethnic minorities also fall under the category of microaggressions, shown by  to be prevalent in higher education STEM spaces. Whether this theory is called chilly climate, the discriminatory environment of STEM, or something else, it has a noticeably negative effect on minoritized students in STEM majors.
A second approach posits that minoritized students leave STEM because they lack sufficient academic preparation. There are two ways to think about this approach: through a deficit lens, and through an anti-deficit lens. Deficit thinking is described by Valencia (2010) as "an endogenous theory-positing that the student who fails in school does so because of [their] internal deficits or deficiencies" (p. 6-7). Deficit thinking steers higher education institutions to hold minoritized students liable for their insufficient preparation in STEM prior to college, ultimately victim blaming students and faulting them for institutions' own lack of diversity (Castro, 2014;Harper, 2010Harper, , 2012McGee, 2020;Valencia, 2010). Anti-deficit thinking recognizes that in educational institutions in the United States, "[r]acialized opportunity structures lead to racialized academic achievement patterns", which includes "school failure" from students both before and during college (Valencia, 2010, p. 2-3). For example, students in racial and ethnic minoritized groups are more likely to attend high schools that prepare them inadequately for college-level academics (Bound et al., 2009;Ciocca Eller & DiPrete, 2018;Deil-Amen & DeLuca, 2010;Jennings et al., 2015). A lack of funding for racially minoritized students' K-12 schools could affect their precollege exposure to fields like computer science or to high-level coursework in STEM (Bottia et al., 2021;Byrd, 2020). Academic under-preparation is also due in part to academic curricular tracking. A large body of literature has found that between-class sorting based on perceived academic ability disproportionately channels racial minority students into low-level academic coursework, where they perform worse than their peers in heterogeneously grouped classes (Gamoran, 2009;Loveless, 2009;Oakes, 1985;Rosenbaum, 1976;Rui, 2009). Rather than "pathologiz[ing]" students, anti-deficit thinking asks for reflection and action from education institutions on their role in ensuring the retention of minoritized students in STEM (Castro, 2014, p. 415).
Even when they are equally academically prepared, Black and Latinx students are more likely than White students to attend open-access community colleges and less likely to attend selective 4-year colleges (Carnevale et al., 2018). Black and Latinx students are then disproportionately assigned to remedial coursework, which increases the time and cost of degree completion (Palmer et al., 2010;Sanabria et al., 2020). Open-access public colleges also receive much less in state appropriations than the selective public colleges that White students are more likely to attend. Researchers have found that lower institutional resources are associated with lower degree completions (Bound et al., 2009). Due to their more ample funds, selective public colleges are able to offer higherquality instruction, advising, and other student support services (Brock, 2010;Carnevale et al., 2018). Openaccess community colleges, on the other hand, often have severely limited advising services that are difficult for students to access (Rosenbaum et al., 2017). Community college students who aim to attain a bachelor's degree in STEM must navigate complex institutional requirements that hamper many students' efforts to make the transition to a 4-year school (Jenkins & Fink, 2016).
A lack of institutional support may also encompass other aspects of higher education for students. For example, because of a paucity of women and people of color in STEM faculty roles, students of color and female students may not see themselves in the upper echelons of STEM and may experience a lack of support and mentoring from institutional figures (Espinosa, 2011;McGee, 2020). Espinosa (2011) also faults colleges, particularly "predominantly White, large public research institutions", for perpetuating "impersonal, large classrooms; unapproachable professors; and competitive grading practices resultant from a system that actively attempts to 'weed' students out of STEM majors" (p. 214), while McGee (2020) suggests that "Eurocentric" STEM departments sustain a culture of "meritocracy", "unrelenting competition", and more (p. 634). These systemic cultural facets of university STEM departments may be particularly discouraging for minoritized populations. For example, among students with low performance in introductory college STEM courses, racial/ethnic minority and female students were less likely than White male students with similar performance to complete a STEM degree (Hatfield et al., 2022). Finally, complex financial aid requirements reduce rates of college completion (Dynarski & Scott-Clayton, 2013;Ciocca Eller & DiPrete, 2018). STEM programs may suffer if they do not have consistent and "strategic" institutional support in the form of funding for both students and program administrators-especially if these SIPs are aimed at low-income or first-generation students (Chubin et al., 2015;Linley & George-Jackson, 2013, p. 101; National Center for Education Statistics, 2019; Rincón & George-Jackson, 2016).

STEM intervention programs
Postsecondary STEM intervention programs (SIPs) are designed to address underrepresentation in STEM. Rincón and George-Jackson (2016a) describe SIPs as "supplemental programs offered by colleges and universities to attract, retain, and support traditionally underrepresented students" (p. 743). Although many SIPs are dedicated to increasing diversity in STEM, the institutional rationale behind these programs varies (George et al., 2019). For example, a qualitative study of 39 SIPs by George et. al. (2019) lists "recruitment and retention", "external funding opportunit[ies]", and the documented Page 4 of 16 Palid et al. International Journal of STEM Education (2023) 10:2 achievements of other SIPs as common motivators for universities to enact STEM programming for their students (p. 1654). While this article focuses on features and outcomes of STEM programs, it is important to note that the reasons programs are initiated are likely to have effects on their results (George et al., 2019). Additionally, although certain programs have better reputations and are more widely studied than others (George et al., 2019(George et al., , p. 1646, geographic and community context is crucial to the formation, running, and discussion of each SIP (Chubin et al., 2015, p. 275;Lent & Brown, 2019).
There are a number of critiques of SIPS: for example, George et. al. (2019) express skepticism towards STEM initiatives solely focused on increasing statistical representation in the student body, as "the presence of individuals from particular backgrounds does not necessarily result in salient markers of postsecondary success, such as students' inclusion, sense of belonging, or persistence" (p. 1652). Achieving these less tangible markers of meaningful postsecondary STEM diversity may require broader institutional reform. Scholars including López et. al. (2022), McGee (2016McGee ( , 2020, Miriti (2020), Robinson (2022), and Whittaker and Montgomery (2012) have argued that student-focused interventions, independent of broader systemic transformation around higher education's biased culture, values, structures, and practices, are insufficient for, or even distracting from, broad and sustained progress toward diversity, equity, and inclusion in STEM. According to this viewpoint, efforts toward systemic change require actively confronting the ways dominant cultural biases are embedded in higher education's research, teaching, and service. Such biases may shape perceptions of which research agendas are legitimate, which activities are most rewarded (where publications and grants outrank mentoring and service toward diversity efforts), and which faculty members' perspectives are considered in the construction of institutional policy (Whittaker & Montgomery, 2012, p. 239-240). SIPs and other interventions that target students rather than higher educational institutions themselves are unlikely to move the needle on these cultural and social dynamics. As Linley and George-Jackson (2013) state, "Programs that seek to repair students rather than initiate institutional change will fail to contribute to the social change that is needed to include and advance underrepresented students in the STEM fields" (p. 100).
Additionally, theoretical frameworks that have been used to understand student persistence in STEM, such as the widely used social cognitive career theory (Fouad & Santana, 2017;Lent & Brown, 2019), emphasize that learning experiences like SIPs are only one piece of a much broader puzzle. Students' interest in and persistence in STEM are shaped by their feelings regarding their own self-efficacy and possible outcomes in STEM, which in turn are influenced by personal factors and contextual factors such as community access to supports and information about STEM, experiences with family and friends, and so forth. Learning experiences are generally not designed to impact these other important influences.
Despite these critiques and considerations, previous reviews and syntheses of STEM program features have documented the benefits of SIPs for women and/ or racial and ethnic minorities. One of the most comprehensive articles may be the work of Tsui (2007), who divides the STEM program features reported in the literature into ten distinct categories: summer bridge, mentoring, research experience, tutoring, career counseling and awareness, learning center, workshops and seminars, academic advising, financial support, and curriculum and instructional reform. Their study reports that mentorship and research experience are some of the most commonly reported STEM program features, although the most successful programs are those which provide the best comprehensive support through multiple features (Tsui, 2007, p. 21). Tsui's (2007) categories have parallels in other articles exploring the features of SIPs. For example, in Rincón and George-Jackson's (2016b) examination of 48 STEM programs, the authors classify these programs' "services" as academic advising, financial support, professional development, exposure to STEM, social interaction, structured learning and tutoring, hands-on experience and research, residential experiences, recruitment, and mentoring and networking (p. 433); and George et. al. (2019) use almost identical program categories. The National Academies of Sciences, Engineering, and Medicine (2016) breaks down SIPs-which they call co-curricular programming-and their features into the categories of internships, summer bridge programs, student professional groups, peer tutoring, living and learning environments, and comprehensive interventions (pp. xviii, 95-102). Most recently, Pearson et. al. (2022) detailed features and outcomes of 25 STEM intervention programs for low-income, first-generation, and underrepresented student groups. The authors focused on STEM programs related to engineering, and they categorized 13 program features: recruitment; professional development/networking; research experiences; tutoring and study skills; targeted academic interventions; graduate school/GRE prep; mentoring; social integration experiences; community service; summer bridge transition programs; experiences influencing character traits; and financial support. They found that these features were correlated with positive outcomes, including year-overyear retention, graduation rate, grade point average, and students' beliefs linked to persistence.
Page 5 of 16 Palid et al. International Journal of STEM Education (2023) 10:2 The current study In this study, we build upon previously published reviews of STEM programming, developing a comprehensive and up-to-date view of DEI-focused SIPS. Like Tsui (2007) and Pearson et. al. (2022), we review STEM programs that center racially minoritized students. We also include programs that target women, focusing specifically on programs to improve DEI in technology, computer science, and engineering-the STEM fields in which future disparities in career outcomes may be critical, as discussed in the introduction. While previous literature has introduced categorizations of program features, few expanded on their labeling schema to the same level as Tsui (2007), and only one has categorized program outcomes (Pearson et al., 2022). While Pearson et. al. (2022) provide an up-to-date review of diversity-focused STEM interventions, our study differs in several important ways: we include studies from a broader range of years as well as studies of technology and computer science programs, and we limit our focus to the undergraduate level.
Here, we categorize SIP program features and outcomes, characterizing each in-depth and describing how they relate to one another. We frame program features as components of institutional supports for individual students and outcomes as evidence of institutional supports, recognizing that the student-centered nature of SIPs may limit the extent to which programs directly address the roots of systemic institutional failures.

Methods
In this article, we conducted a systematic review (Alexander, 2020; Booth et al., 2016) of 82 qualitative, quantitative, and mixed-methods studies published between 1991 and 2020 that report on STEM programs at 4-year U.S. colleges. The central research question for this article was: what are the features of STEM programs that produce positive outcomes for underrepresented minorities? Our goal was to describe and synthesize this literature using systematic procedures, as outlined below (Alexander, 2020;Booth et al., 2016). Given weaknesses in the underlying literature, this review does not contain a meta-analysis of the literature. We do not estimate the size or directionality of associations between STEM program components and student outcomes, as this kind of analysis was not appropriate for this literature set. Next, we detail the methods for our selection, coding, and analysis of the literature reviewed in this study.

Criteria for inclusion of literature
After defining our research questions, we outlined search criteria and criteria for inclusion. Our original criterion for literature was that studies should be focused on a technology, computer science, or engineering education program. For the purposes of this article, we defined a "program" similarly to Rincón andGeorge-Jackson (2016a, 2016b), whose definition of a SIP is incorporated into the literature review. Upon finding few studies that focused exclusively on technology, computer science, or engineering, we expanded our search to STEM in general, seeking to include programs that may have addressed the fields of technology/computer science/engineering but used the term "STEM" to describe their programs. Programs narrowly focused on specific disciplines within the "science" or "math" realms of STEM (e.g., environmental science, chemistry) were excluded, as the information yielded in such studies may not apply to our interest area of technology, computer science, and engineering. Studies also needed to speak directly about a program or intervention, or specific program or intervention features, rather than about general practices that could be used in any program or initiative. Next, we limited our search to articles that were published in peer reviewed journals and that focused on the postsecondary education level, and we later limited our analysis to undergraduate education due to a lack of literature on graduate student and workforce programming. We only included programs implemented in community college settings if they were also implemented at or in partnership with a 4-year institution. We also restricted the search to the United States to collect a cohesive body of evidence in a national context, and we narrowed our review to studies that included some documentation of outcomes. These outcomes could be either quantitative or qualitative, but they needed to be present in some form. Studies reporting any outcomes were included, even if those outcomes were unrelated to our constructs of interest. Program descriptions without outcomes, policy pieces, and editorials were excluded.
Articles included in the review were required to have disaggregated data for the minoritized groups of interest. This meant one of several things: either the program was geared towards or targeting students from a specific minoritized group, so outcomes were implicitly disaggregated by group; demographic data for participants were reported and included some participants from minoritized groups; or outcome data were disaggregated based on some sort of demographic data for the minoritized groups. The last major requirement for studies was that the program or intervention was directly studentfocused; faculty-or institution-focused interventions that may or may not have indirect effects on students via faculty or institutional behaviors were excluded.

Search and eligibility
We searched the Education Resources Information Center (ERIC) database on both the ProQuest (covers 1966-2021) and EBSCOHost (does not specify dates covered) search engines to find the most comprehensive body of literature possible. We generated search terms by brainstorming possible iterations of the terms STEM or technology education, postsecondary education, and diversity, and running these terms through the databases in various combinations. This was achieved by utilizing Boolean shortcuts and syntax. NOT terms were added when searches yielded too many articles displaying characteristics that did not meet eligibility criteria (ex. NOT K-12). See Additional file 1: Appendix A for a comprehensive list of search strings. All articles were imported into RefWorks citation management software, and all screening for eligibility was done by one author. These searches initially yielded 9187 articles after removing all duplicates. Due to the high volume of articles, we utilized the tag function of RefWorks to remove articles including tags that would exclude them from eligibility. Once studies were flagged by the tag function, the screener scanned all titles and saved those they thought might be pertinent to the review. See Additional file 1: Appendix B for a complete list of tags searched and removed, along with articles saved by title search. The remaining articles underwent abstract scanning by one author based on the predetermined eligibility criteria. After abstract scanning, 144 articles remained. Two authors reviewed the 144 articles' contents, and articles found to be ineligible through full text review during this phase were then excluded. An additional three articles were excluded because we were unable to access the full text. Two additional articles were added at this stage-these articles were referenced by one of the excluded articles, and we found that they fit our criteria. After the full text of the articles were analyzed by the authors, 93 studies were found to be eligible for this review of literature. Because we then limited the scope of this article to undergraduate education, the final number of studies included was 82.

Analysis
Three of the authors coded the studies in an Excel spreadsheet using a binary coding scheme. We devised a coding scheme utilizing the PICOS framework. PICOS typically stands for Participants, Intervention, Comparison, Outcomes, and Study Design (Pollock & Berge, 2018;Methley et al., 2014). In coding participant data for the studies, we coded for the program's intended target population and then documented the number of participants, participant race and ethnicity, and gender. When participants included multiple racial groups, as was the case with most articles, all groups mentioned in the study were included. When participant race or gender was not specified, this was coded for. We also coded for several other factors that emerged during our analysis, including whether participants were considered low-income or were first-generation college students. While we focused on articles that targeted minoritized groups, especially in terms of race and ethnicity and gender, we also included articles that disaggregated for or noted characteristics of students such as first-generation, low-income, disabled, and academically at risk.
With regard to the interventions covered in the literature, our coding focused on documenting the type of program. Specifically, we coded for types of activities or components implemented as part of the programs or interventions. We separated these programs into areas of study: engineering; computer science; or STEM in general, defined as programs that targeted STEM students but did not specify subject areas. We also coded whether a program targeted subjects in addition to engineering, computer science, or STEM in general (such as other science or math courses). These categories were maintained throughout coding, and when multiple areas of emphasis were noted in the study, all categories were indicated. We coded for outcomes and how the study addressed the counterfactual: if they compared to a control or comparison group, and if they calculated the statistical significance of the difference between groups. This specific coding was an iterative process which followed the model of hybrid coding (Braun & Clarke, 2012). We also recorded study design based on type of study-qualitative or quantitative.
All three coders coded 19 of the 82 articles, and inter-rater reliability (IRR) for coding of program interventions, outcomes, and statistical significance was calculated using Fleiss' kappa. Kappa was calculated to be 61%, indicating substantial agreement (Landis & Koch, 1977). Disagreements were discussed between the coders, and a consensus was reached for all disagreements. Coding was then revised to reflect consensus decisions, but this was not included in calculation of IRR. The remaining 63 studies were split between the three coders. When questions about coding arose, the coders met to resolve confusion and recoded data as necessary.

Program component descriptions
In order to determine what commonly found program components looked like in practice, we gathered all articles containing each specific component that found statistically significant results in one or more areas, and we analyzed the descriptions of each specific component using deductive coding (Braun & Clarke, 2012). We Page 7 of 16 Palid et al. International Journal of STEM Education (2023) 10:2 present a narrative summary of this qualitative review in our results.

Results
These results are organized by the elements of the PICOS framework examined in this study. See Additional file 1: Appendix C for details on individual articles.

Participants
Articles often reported on the number of participants in the actual program (or programs) being studied, but they more frequently and reliably provided the number of participants in the studies of the program (which might include control groups). The reported program sizes ranged from quite small (serving 16 or 19 students, for instance) to quite large (serving thousands of students). Table 1 shows the range of participant numbers in studies of the programs. In interpreting Table 1, please note that while 52 articles in our review contained a single study of a program or program set, 30 articles contained multiple studies of a program or program set, each with slightly different samples. Thus, Table 1 reports the frequency for each study, meaning the total N is higher than the 82 total articles in our review. These sample sizes ranged from 17 to 12,000 participants. (We cannot calculate an exact mean, as studies sometimes reported sample sizes in a fashion that allowed general inference about size, but not a specific n.). Sample sizes did relate to the kinds of studies conducted. The 30 articles that contained multiple studies used both quantitative and qualitative methods 70% of the time, while articles with single studies used both methods 10% of the time. Indeed, articles with single studies were predominantly quantitative only (75%); both multiple-and single-study articles measured significance at a fairly similar clip (70% and 62%, respectively). Similarly, articles that included small samples of study participants (< 50) were much more likely to use both methods (60%) than articles with only large samples (23%). Articles with larger samples tended to use solely quantitative methods (69% of the time) and to measure significance more frequently than small-sample articles (73% and 40%, respectively). This makes logical sense, as many articles conducted quantitative analyses with large numbers of program participants, then conducted more qualitative analyses with a subsample.
Frequencies of articles (N = 82) that reported participants being in each gender and race/ethnicity categories are reported in Table 2. Most articles described programs that served both genders, and it was also most common that programs served multiple racial/ethnic groups. In addition, 13 articles reported participants' age, while 69 did not. For the articles that reported age, all 13 had participants between 17 and 25 years old, while 5 had additional participants beyond that age range. Eleven articles reported that some or all of their participants were from low-income households, two articles included some or all participants with disabilities, and 19 articles reported that some or all participants were first-generation college students.

Interventions/conditions
The 82 articles in this review examined a wide range of programs. Seventy articles examined a single program, while 12 articles examined a set of programs deemed  similar in some regard. A handful of programs were examined by multiple articles, led by the Meyerhoff Scholars Program (9 articles), but most articles focused on unique programs. As noted previously, some articles examined the program(s) with one study/sample (N = 52), while others examined the program(s) with multiple studies/samples (N = 30), often using mixed methods (i.e., a quantitative look followed by a qualitative look with fewer participants). However, no articles examined multiple programs separately; indeed, all articles that examined multiple programs did so with one combined study/sample. Out of the total number of articles (N = 82), 47 studied engineering programs, 35 studied computer science or technology programs, 14 studied general STEM programs, and 52 studied programs that targeted some other combination of STEM disciplines. Since many programs had more than one disciplinary target, the total in the previous sentence is greater than 82; for instance, an article that looked at a program for engineering and computer science students would be counted twice-once for engineering and once for computer science.
Articles most commonly were focused on programs targeting students who were underrepresented in STEM on the basis of race or ethnicity (N = 38), while a substantial number focused on women (N = 27) and generalized "underrepresented minorities" (N = 20). Some studies (N = 16) concentrated on another target population, such as students who were low-income, first-generation, or academically underprepared. Many articles (N = 19) studied programs with multiple targets (making the total count greater than 82). There were also several programs without a target population (N = 9), but they were included because they disaggregated the data for different student populations. Table 3 documents the most frequently reported outcomes among articles included in this review. This does not differentiate which outcomes had better results, only the frequency with which they were found in the scope of the review. In interpreting Table 3, note that 56 out of the 82 articles included multiple outcomes, so the N is much higher than 82.

Outcomes
Of the 82 articles included in the review, 72 used quantitative methods and 36 used qualitative methods, with 26 using both methods. One major finding of this review pertains to the quality of the quantitative research base on this topic. Of the 72 articles with a quantitative component, only 53 measured statistical significance in Table 3 Most frequently studied program outcomes in the literature

Outcome Definition Frequency
Retention/graduation rate Measure of students remaining in their discipline during the measurement period, or remaining until graduating 42 Academic outcomes Academic performance such as GPA, pass rates, class grades, etc. 32 Psychological outcomes Psychological benefits such as increased self-efficacy or sense of belonging 41 Graduate school admission/intent Number of students admitted into graduate programs, or number of students intending on enrolling in graduate school 22 Employment Students gaining employment in their discipline after graduation 5 Other Studies that examined an idiosyncratic outcome, often a program evaluation or other non-generalizable measure 46  relation to a null hypothesis based on some sort of comparison or control group, such as through longitudinal designs or pre-post analyses, and none used other statistical methods to evaluate outcomes. Table 4 displays information about significant findings for Retention Outcomes, Academic Outcomes, and Psychological Outcomes. For each of these outcomes, we present number of articles that measured significance, number and percentage of those articles that reported positive significance, and the frequency of features reported for the program(s) studied in those articles. Note that many of the articles measured more than one of the outcomes. Because there were not enough studies to make meaningful comparisons in articles measuring the statistical significance of Graduate School Outcomes and Employment Outcomes, those articles are not included in Table 4 (changing the total number of articles to 47).
It is important to keep in mind that the findings reported by these articles, as a group, should not be interpreted causally. Few articles employed experimental or quasi-experimental designs, and most programs that were studied had multiple components-meaning that the components were studied together, not individually. In interpreting these results, we note that publication bias may have prevented studies that found null or negative results from being included in this review. In turn, the positive associations found should be interpreted cautiously (Torgerson, 2006). A version of Table 4 in which positively significant findings are related to features can be found in Additional file 1: Appendix E. Because many of the features were combined in programs, we believe such a breakdown can be slightly misleading, and thus do not include it in the main text.
Finally, Table 5 delineates the most common program features found as a part of our review and the frequency with which they showed up in the 82 articles without accounting for effectiveness or outcomes. The categories of program features will be explained according to their descriptions in the existing literature. Again, note that the sum of frequencies is greater than 82 because most articles (N = 67) studied programs with multiple features.

Skill building
Skill building refers to opportunities for students to apply academic or professional skills in context. In the existing literature, skill building programs were most frequently present in the forms of undergraduate research and service learning.
Most undergraduate research programs were carried out during the summer (Dunn et al., 2018;Hrabrowski & Maton, 1995;Huziak-Clark et al., 2015;Kassaee & Rowell, 2016;Maton et al., 2000;Pender et al., 2010). These programs usually lasted around 8 to 10 weeks (Dunn et al., 2018;Huziak-Clark et al., 2015) and required a full-time commitment of 40 h per week during that time (Huziak-Clark et al., 2015). However, there were several undergraduate research programs that ran during the school year and required students to participate for around five hours per week (Fisler et al., 2000;Windsor et al., 2015). These programs were generally staffed by existing university or college faculty (Baron et al., 2020;Dunn et al., 2018;Fisler et al., 2000;Huziak-Clark et al., 2015;Kassaee & Rowell, 2016;Windsor et al., 2015). Although it was not specified in many cases, it seems that most students were placed into research teams on their own campus, though some programs placed students at other universities, government, and corporate research sites (Hrabrowski & Maton, 1995;Maton et al., 2000;Pender et al., 2010). Programs with skill building components might include workshops or classes on research skills (Baron et al., 2020;Fisler et al., 2000) in addition to more hands-on research activities. Some programs placed students into existing faculty research projects (Fisler et al., 2000), with assignment based on student interests (Huziak-Clark et al., 2015). Many programs offered participants compensation or a stipend for the time they spent working on these projects (Dunn et al., 2018;Fisler et al., 2000;Kassaee & Rowell, 2016;Windsor et al., 2015). The other way skill building frequently played a part in undergraduate SIPs was in the form of service learning. In the existing body of literature, service learning was often built into for-credit courses as a part of a program (D'Souza et al., 2018;Liou-Mark et al., 2018). According to Howard (2001), service learning includes three main parts: "relevant and meaningful service with the community", "enhanced academic learning", and "purposeful civic learning" (p. 15). Examples of service-learning activities include peer leadership on campus (Liou-Mark et al., 2018) and participating in STEM outreach programs in K-12 schools (D'Souza et al., 2018).

Supplemental learning
Supplemental learning refers to opportunities for learning content, academic skills, or professional skills outside of regular university programming. Supplemental learning was presented in several ways in the established body of research, including workshops and seminars, tutoring, supplemental instruction, and learning communities.
Many SIPs included workshops or seminars for program participants. The content of these activities varied, but often included instruction on study skills and learning strategies as well as college life skills (Dunn et al., 2018;Kassaee & Rowell, 2016;Lisberg & Woods, 2018;Van Sickle et al., 2020). Workshops and seminars also featured guest speakers and information on different careers or opportunities (Allen, 1999;D'Souza et al., 2018;Dunn et al., 2018;Gibson et al., 2019;Huziak-Clark et al., 2015;Van Sickle et al., 2020). The frequency of these events varied by program. When scheduling was specified in the research, it was most commonly documented that they occurred on a monthly basis (D'Souza et al., 2018;Dunn et al., 2018;Huziak-Clark et al., 2015).
Tutoring is another activity that falls under the umbrella of supplemental learning. Though many studies reported that programs included or required a tutoring component, most did not disclose detailed information on these activities. However, some studies reported that tutoring was staffed by graduate assistants or peer tutors (Dagley et al., 2016;D'Souza et al., 2018;Pender et al., 2010). Instruction was another frequently observed supplemental learning activity. It was sometimes required for program participants or was graded and attendance based (Peterfreund et al., 2008;Van Sickle et al., 2020). Supplemental instruction sessions may have been staffed by peer leaders who had done well in the class previously (Archat-Mendes et al., 2019;Van Sickle et al., 2020). When documented in the studies, supplemental instruction took up 90 min (Peterfreund et al., 2008) or 150 min (Van Sickle et al., 2020) per week.

Mentoring
Mentorship was an integral part of many programs examined in the scope of this review. This included peer, faculty, and professional mentoring. Peer mentors in these programs were frequently upperclassmen who were alumni of the programs themselves (Good et al., 2002;Huziak-Clark et al., 2015;Ikuma et al., 2019;Lisberg & Woods, 2018). Configurations and models of peer mentoring varied greatly among programs, from one-on-one mentoring (Huziak-Clark et al., 2015) to one mentor per ten students (Dunn et al., 2018). Peer mentors' roles included providing social-emotional support (Dunn et al., 2018), sharing their own experiences (Lisberg & Woods, 2018), and leading workshops or supplemental instruction for mentees (Liou-Mark et al., 2018;Van Sickle et al., 2020).

Socializing
Social components were described in less detail than other types of program components. However, from the body of literature, we concluded that these types of activities may include cultural events (Allen, 1999;Hrabrowski & Maton, 1995;Maton et al., 2000), dinners (Allen, 1999;Good et al., 2002), field trips (Gibson et al., 2019;Windsor et al., 2015) and networking events (Windsor et al., 2015). One program, the Meyerhoff Scholars Program, included family members as part of their program community by inviting them to events (Hrabrowski & Maton, 1995;Maton et al., 2000). The frequency of social activities varied greatly from weekly (Good et al., 2002) to once per semester (Van Sickle et al., 2020). A notable strategy utilized by one SIP was to recruit upperclassmen to serve as leaders during networking events and interact with underclassmen (Windsor et al., 2015).
Learning communities were also included in several programs. Study groups were often a large part of these communities (D'Souza et al., 2018;Pender et al., 2010;Windsor et al., 2015). Several learning communities provided an option to live together in the same dorm community, also known as living-learning communities (Allen, 1999;Fisler et al., 2000;Sezelenyi & Inkelas, 2011).
Page 11 of 16 Palid et al. International Journal of STEM Education (2023) 10:2 Learning communities seemed to be a larger framework into which other types of program components were integrated.

Financial aid
Of the programs that offered some sort of financial aid or incentive, aid was primarily offered in the form of scholarships or stipends. Although some studies of programs that included scholarships did not specify the amount of aid given, those that did were primarily focused on the Meyerhoff Scholars program, which provides a full academic scholarship, room and board, and covers books and fees for participating students on the basis that they maintain a B average and a science or engineering major (Hrabrowski & Maton, 1995;Maton et al., 2000;Pender et al., 2010). Furthermore, one SIP based scholarship amounts on a sliding scale dependent on students' scores on application criteria (D'Souza et al., 2018). Those programs that included stipends varied in the amount of support provided, from $250 (Baron et al., 2020) to $3500 (Dunn et al., 2018), depending on the amount of work or time commitment expected in return. Several programs' stipends were dependent on participation in program activities or research work (Baron et al., 2020;Dunn et al., 2018;Fisler et al., 2000;Lisberg & Woods, 2018).

Discussion and implications
This study began by asking the question: what are the features of STEM programs that produce positive outcomes for underrepresented minorities? We systematically reviewed 82 published articles on STEM intervention programming in the United States. Studies focused particularly on the fields of technology, computer science, engineering, or STEM in general, and they disaggregated information on students' gender and/ or racial or ethnic identity. Like Tsui (2007), George et. al. (2019, Rincón andGeorge-Jackson (2016), andPearson et. al. (2022), this article categorizes the common features of STEM intervention programs. We found six groups: supplemental learning, mentorship, skill building, financial aid, socializing, and bridge programs. All of these components can be considered institutional supports to address prior educational system failures, where failures include excluding underrepresented minorities from STEM environments (Bottia et al., 2021;Giles, 2015;Hall & Sandler, 1982;Harper, 2012;Kanny, et al., 2014;Lee & McCabe, 2020;Lord et al., 2009;McGee, 2020;Morris, 2003;Ong, 2005;Rincón & George-Jackson, 2016a), inadequately preparing them for rigorous coursework (Bottia et al., 2021;Bound et al., 2009;Byrd, 2020;Ciocca Eller & DiPrete, 2018;Deil-Amen & DeLuca, 2010;Gamoran, 2009;Jennings et al., 2015;Loveless, 2009;Oakes, 1985;Rosenbaum, 1976;Rui, 2009;Valencia, 2010); arduous course requirements coupled with inadequate advising (Brock, 2010;Carnevale et al., 2018;Palmer et al., 2010;Rosenbaum et al., 2017;Sanabria et al., 2020); and burdensome financial aid processes for low-income students, who are disproportionately students of color (Dynarski & Scott-Clayton, 2013;Ciocca Eller & DiPrete, 2018). Additionally, we categorized commonly reported program outcomes into five groups: retention or graduation rate, academic outcomes, psychological outcomes, graduate school admission or intent, and employment. Only about two-thirds of the quantitative articles included in this review used statistical techniques to evaluate their outcomes. Articles may have not measured or not reported how participant outcomes fared relative to comparison students for a number of reasons, including low numbers of participants, poorly matched comparison groups, or a lack of statistically significant findings. In turn, the sample of articles we used to evaluate program effectiveness may be biased, such as by having an overrepresentation of articles with positive findings. We also note that the majority of the articles we reviewed measured correlations rather than causal relationships between program features and student outcomes. In turn, we document these correlations but are not able to provide evidence that program components directly cause any of the outcomes observed. These findings point to the need for more rigorous quantitative methodological designs that evaluate the effectiveness of various program features, implemented independently and in combination with one another. When limiting our analyses to articles that evaluated participant outcomes relative to a null hypothesis based on comparison students, we found that each category of features showed promise for improving outcomes for minoritized students in STEM, as all were included in studies that found statistically significant positive outcomes. One reason why these program features appear to be successful at achieving positive outcomes may be due to a negation or softening of STEM's chilly climate. Because these programs are dedicated to uplifting minoritized students, students have the chance to socialize and learn with others who share their experiences with and feelings about STEM (Tsui, 2007). For example, in Ramsey et. al. (2013)'s study of the University of Michigan's Women in Science and Engineering program, the researchers found that "environmental reminders of ingroup success made women seem more prevalent in STEM careers and reduced participants' stereotyping concerns" (p. 393). The committed support for diverse students within university STEM programming could explain why these programs are successful at achieving their outcomes, as well as showcase a possible solution to helping students feel like they belong in STEM spaces.
The STEM program features studied here may also be successful because they are dedicated to improving academic preparation and providing other student supports (Valencia, 2010). STEM programs with features like financial aid, supplemental learning, skill building, and bridge programming give students educational and institutional support beyond the norm. The Meyerhoff Program, mentioned above, is one such example. By providing financial support that is contingent on high grades, the Meyerhoff Program and others like it aid students while driving them to succeed. Meyerhoff participants are also provided access to tutors, optional study groups, a preparatory summer bridge program, and academic counselors, ensuring that students do not have to struggle by themselves to meet high program expectations (Maton et al., 2000;Tsui, 2007). This initiative is just one example of how colleges can use their resources to address systemic institutional failures and help students in STEM to thrive.
By showing the success of program features at achieving positive outcomes for students from diverse backgrounds, this review provides evidence that SIPs can help students who historically have been insufficiently supported in STEM, particularly the technology, computer science, and engineering fields, to persist and achieve. Therefore, colleges devoted to diversity in STEM fields should consider creating or expanding these STEMfocused programs. With technology and science as omnipresent as they are, helping present and future college students explore their passions for STEM is more critical than ever (Bottia et al., 2021;Funk & Parker, 2018;U.S. Bureau of Labor Statistics, 2021).
However, the slow progress at improving diversity in STEM over the past 40 years (Kanny et al., 2014;Hall & Sandler, 1982;NCSE, 2021;Ong et al., 2011) despite the increasing prevalence of SIPS (Rincón & George-Jackon, 2016a, 2016b suggests that scaling these programs alone is insufficient to achieve equitable representation. The program features identified do not show the full breadth of actions that program-running institutions can and should take to promote diversity in STEM and tackle discrimination in the world of academia and beyond (Allen-Ramdial and Campbell, 2014;BrckaLorenz et al., 2021;George et al., 2019;McGee, 2020). While this systematic review primarily focuses on programs implemented within existing systems or institutions, what may really be necessary to remedy this issue sustainably is a fundamental change in the way that these systems and institutions operate (López et al., 2022;McGee, 2016McGee, , 2020Miriti, 2020; National Academies of Sciences, Engineering, & Medicine, 2016; Robinson, 2022;Whittaker & Montgomery, 2012).

Limitations
This review has several limitations. For example, we may have passed over interventions that were very effective or promising because the article studying the program did not provide indicators to assess effectiveness. Publication bias may also have been a significant limitation in our systematic review. There is a documented tendency for journals to publish studies that have positive results (Torgerson, 2006). We could be missing out on a well-rounded body of literature because studies deeming STEM program practices to be ineffective may not be published.
Because the programs reviewed specifically focused on undergraduate programs, we cannot be sure that DEI programs housed within other types of postsecondary programs would have the same outcomes. Similarly, because we grouped different populations together in our analysis, we are unable to tease apart relationships between program features and outcomes for specific minoritized subgroups. In turn, not all findings may be applicable to all groups underrepresented in STEM (Chubin et al., 2015). SIP program features that are effective overall may not be effective for certain racial or ethnic groups or for women, and SIPs that are effective for White women may not be helpful for women of color because of "the way in which gender operates together with race" (Lord et al., 2009). As an example, Lord et. al.
Page 13 of 16 Palid et al. International Journal of STEM Education (2023) 10:2 (2009) note, "Women in engineering do not necessarily share common experiences of marginality. For example, women of color may experience both sexism and racism, compounding their experiences of exclusion. " Therefore, STEM programs serving women of color must actively work to address "the double bind" of oppression that these students face, or else they may fail or only partially succeed (Hall & Sandler, 1982;Lord et al., 2009;Ong et al., 2011).

Implications
This systematic review yields several implications for practice. The first is that all the program components included in this review show promise for improving outcomes for minoritized students in undergraduate STEM programs. Supplemental learning and mentorship appeared the most in articles showing positively significant findings. However, skill building, socializing, bridge programs, and financial aid show potential for success. These features were less common overall, but studies of programs that included them often found positive results. The majority of programs covered in this systematic review had multiple program components. As others have argued (Tsui, 2007), we believe it is likely that providing a wide number of features in a program increases participant success, in part because such programs seem to address multiple institutional failures underlying underrepresentation in STEM rather than just one. Programs focused on minoritized students in STEM can address not only academic preparation, but also issues of campus climate and culture as well as institutional structures such as admission policies and distribution of resources. Future research should examine not only the efficacy of individual components, but of components as they interact. More research should be done specifically on financial aid, socializing and bridge programs as interventions for minoritized groups in STEM. These components show promise in the existing research, but further study would help to support or refute their efficacy. Another implication for research relates to the quality of the body of literature we found. A greater number of high-quality quantitative studies on SIPs must be published in order to promote best practices and ensure the successes of future generations of minoritized students in STEM. This means using the necessary statistical methods to support conclusions about programs and interventions, something that is not prevalent in the current body of research about STEM programming. Finally, researchers should examine how SIPs and other student-focused interventions are implemented alongside of, or instead of, interventions that address institutions' systemic biases and barriers to inclusion.

Conclusion
In this article, we presented an updated systematic review of 82 articles about diversity-focused STEM programs and their features and outcomes. The aim of this review was to answer the question: what are the features of STEM programs that produce positive outcomes for underrepresented minorities? Following in the footsteps of prior literature reviews on this topic, we created new categories for STEM program features, and went a step further to classify commonly studied outcomes of these programs. We found that the program features examined here-supplemental learning, mentorship, skill building, financial aid, socializing, and bridge programs-represent various forms of institutional support for STEM students, and all have demonstrated associations with positive outcomes for SIP participants. Thus, students struggling in STEM due to an unwelcoming climate, inadequate prior academic preparation, or other institutional shortcomings may find their retention, academic success, and psyche boosted after participating in an SIP. Although the interventions investigated throughout this review were successful, more work needs to be done to enhance our understanding of how to promote and sustain equity in STEM for minoritized students.