Scientific validation of Saiens' Self-Assessments
Sapiens has selected scientifically validated assessments based on 4 criteria:
- Reliability: internal consistency (e.g., Cronbach’s α), test–retest stability.
- Construct validity: factor structure and expected relationships with related constructs (e.g., burnout)
- Criterion validity / diagnostic accuracy (when applicable): performance against a reference standard (structured clinical interview, clinician ratings, or established case definitions), often summarized by sensitivity/specificity at a cutoff.
- Responsiveness: ability to detect change over time (important for interventions).
Important: many of the tools in a wellness setting are not designed as diagnostic tests (e.g., Sapiens doesn't diagnose depression or other diseases), so “sensitivity/specificity” is either not applicable (purely dimensional scales) or is cutoff- and population-dependent (screening tools). At Sapiens, results are interpreted as risk signals and severity gradients to guide prevention and follow-up - never as a medical diagnosis.
Self-assessments used in the Sapiens diagnostics
Self-reported stress:
Perceived Stress Scale
Brief Resilience Scale
Perceived Fatigue Scale
Maslach burnout inventory (MBI)
Work environment:
Work Demand - Support – Control (JCQ)
Psychological Safety (PSQ)
Effort-Reward Imbalance (ERI)
Mental health:
Emotional Wellbeing (PHQ-8)
Anxiety (GAD-7)
Body system health:
Gastrointestinal Symptoms (GISI)
Immune System Symptoms (ISQ)
Musculoskeletal System Health (NMQ)
Habits:
Pittsburgh Sleep Quality Index (B-PSQI)
Food Frequency Questionnaire (adapted by Sapiens based on various studies)
Other habit questionaires (movement/exercise, emotional regulation, attention regulation, ...)
Other:
Self-reporting of physical/mental health issues
Survey of confounding factors forbiomarker data
Self-reported stress scales
Perceived Stress Scale (PSS-4)
What it measures
The PSS measures the degree to which people appraise situations in their life as unpredictable, uncontrollable, and overloaded in the past month - core components of perceived stress appraisal. The 4-item form (PSS-4) is an ultra-brief version intended for low-burden settings.
How it was validated
The original PSS development established construct validity by showing expected associations with stress-related outcomes and life-event measures; shorter forms (including PSS-4) were subsequently used where brevity is required, typically with somewhat reduced reliability relative to longer forms. Modern reviews summarize its broad use and psychometric performance across contexts.
Sensitivity/specificity
PSS/PSS-4 is not a diagnostic test for a disorder; sensitivity/specificity is generally not the primary framework. It is best treated as a continuous stress appraisal index.
Interpretation (how Sapiens uses it)
- Higher scores indicate higher perceived stress burden in the last month.
- Interpretation should account for context (acute deadlines vs. sustained load) and triangulate with objective recovery/stress physiology (e.g., sleep recovery, autonomic strain).
- Most informative when repeated over time (trend direction and change magnitude).
Key limitations
Recall bias, response style (social desirability), cultural/language differences, and reduced precision for ultra-short forms compared with PSS-10/PSS-14.
Key sources
Cohen et al., 1983: A global measure of perceived stress
Harris et al., 2023: The Perceived Stress Scale as a Measure of Stress: Decomposing Score Variance in Longitudinal Behavioral Medicine Studies
Brief Resilience Scale (BRS)
What it measures
The BRS was designed to measure resilience specifically as the ability to “bounce back” from stress—distinct from resilience resources (e.g., social support) or protective factors.
How it was validated
The original validation tested factor structure (largely unidimensional), internal consistency, and convergent/discriminant validity across multiple samples (including clinical groups). Subsequent work supports its psychometric properties in additional languages and clinical contexts (including German validation work).
Sensitivity/specificity
BRS is typically used dimensionally; diagnostic sensitivity/specificity is not a standard use case.
Interpretation (how Sapiens uses it)
- Lower resilience scores can flag reduced recovery capacity and may contextualize why similar stress exposure produces different symptom or fatigue profiles across individuals.
- Best used alongside stress exposure (workload, sleep disruption) and physiology (recovery metrics), not as a standalone “trait label.”
Key limitations (especially with selected items)
Using only selected questions reduces measurement precision and may change the scale’s reliability and factor properties. In Sapiens, selected BRS items should be treated as a brief resilience indicator, not a full substitute for the validated 6-item total score.
Key sources
Smith et al., 2008: The brief resilience scale: assessing the ability to bounce back
Broll et al., 2024: Psychometric properties of the German version of the brief resilience scale in persons with mental disorders
Fatigue Assessment Scale (FAS)
What it measures
The FAS assesses fatigue as a largely unidimensional construct capturing both mental and physical fatigue symptoms in a brief format.
How it was validated
The original validation work examined dimensionality, reliability, and convergent validity against related fatigue measures and adjacent constructs (e.g., mood). It has been widely used in working populations and chronic disease contexts as a fatigue severity index.
Sensitivity/specificity
FAS is usually interpreted dimensionally; sensitivity/specificity depends on how one defines a “case” of clinically significant fatigue, which varies by context.
Interpretation (how Sapiens uses it)
- Higher fatigue scores suggest elevated fatigue burden; interpretation should consider sleep quality, workload, recovery physiology, and mood symptoms (fatigue overlaps with depression/anxiety and sleep disruption).
- Repeated measures help distinguish transient overload from persistent fatigue trajectories.
Key limitations (especially with selected items)
Selected questions are best treated as a fatigue signal, not a fully validated total score. Fatigue is also non-specific: it can reflect sleep loss, high stress load, infection, anemia, endocrine issues, mood disorders, medication effects, or overtraining.
Key sources
Michielsen et al., 2003: Psychometric qualities of a brief self-rated fatigue measure: The Fatigue Assessment Scale
Maslach Burnout Inventory (MBI)
What it measures
The MBI operationalizes burnout as a work-related syndrome classically characterized by exhaustion, cynicism/depersonalization, and reduced professional efficacy/personal accomplishment, depending on the version (Human Services, Educators, General Survey).
How it was validated
Foundational work established the scale structure and relationships with job factors and outcomes. A large body of subsequent literature supports its reliability and validity across occupations and countries, including meta-analytic reliability generalization.
Sensitivity/specificity
Burnout measurement is primarily dimensional. There is no universally accepted clinical gold standard for burnout as a diagnosis, so sensitivity/specificity is not consistently defined across studies. (Some studies evaluate screening cutoffs for specific subscales, but these are context-dependent.)
Interpretation (how Sapiens uses it)
- Patterns across subscales matter (e.g., high exhaustion with rising cynicism suggests worsening job strain and disengagement risk).
- Interpretation should be paired with work environment measures (demand/control/support, effort–reward imbalance, psychological safety) to identify plausible upstream drivers.
- Burnout signals should trigger prevention action and—when severe or persistent—professional evaluation.
Key limitations
MBI is copyrighted and exists in multiple versions; scores are influenced by occupational context, role expectations, and response styles. Burnout overlaps with depression and chronic stress but is conceptually distinct; misclassification is possible if interpreted without context.
Key sources
Maslach & Jackson, 1981: The measurement of experienced burnout
Maslach et al., 2001: Job Burnout
Aguayo et al., 2011: A meta-analytic reliability generalization study of the Maslach Burnout Inventory
Work environment
Effort–Reward Imbalance Questionnaire (ERI)
What it measures
ERI quantifies stress arising when high effort at work is not matched by adequate rewards (salary, esteem, job security/career opportunities), often complemented by “overcommitment” as a personal pattern amplifying risk.
How it was validated
The ERI model and questionnaire have extensive validation across occupational groups and countries, including factor structure, internal consistency, and predictive associations with health outcomes and work stress indicators.
Sensitivity/specificity
Primarily dimensional; commonly interpreted via ERI ratio (effort relative to reward) and overcommitment levels rather than diagnostic cutoffs.
Interpretation (how Sapiens uses it)
- High effort + low reward profiles indicate a high-risk work stress configuration; this helps explain persistent autonomic strain and impaired recovery despite “healthy habits.”
- Supports targeted organizational and individual interventions (role clarity, reward structures, workload, boundary setting).
Key limitations
Context dependent (industry, seniority), susceptible to response bias, and not a clinical diagnosis tool.
Key sources
Zurlo et al., 2010: Validity and reliability of the effort-reward imbalance questionnaire in a sample of 673 Italian teachers
Schaufeli et al., 2000: The Validity and Reliability of the Dutch Effort-Reward Imbalance Questionnaire
Work demand–support–control (JCQ / DCSQ family)
What it measures
Demand–control–support frameworks quantify psychosocial job strain through psychological demands, decision latitude/control, and social support. The Job Content Questionnaire (JCQ) is the classic instrument family; the Demand Control Support Questionnaire (DCSQ) is a widely used derivative.
How it was validated
JCQ has extensive international reliability evidence and cross-national applicability work; DCSQ versions have published psychometric validation including language adaptations.
Sensitivity/specificity
Not typically framed as diagnostic. Interpretation often uses quadrant approaches (e.g., high demand/low control “high strain”) and continuous scale analysis.
Interpretation (how Sapiens uses it)
- Identifies whether stress load is driven mainly by demand volume, low autonomy, or lack of support—each implies different intervention levers.
- Helps interpret recovery suppression that does not resolve with lifestyle improvements alone.
Key limitations
Job strain is multidimensional; industry and role norms affect interpretation; cross-cultural comparability can vary by scale.
Key sources
Karasek et al., 1998: The Job Content Questionnaire (JCQ): An Instrument for Internationally Comparative Assessments of Psychosocial Job Characteristics
Mauss et al., 2018: Validating the Demand Control Support Questionnaire among white-collar employees in Switzerland and the United States
de Araújo & Karasek, 2008: Validity and reliability of the job content questionnaire in formal and informal jobs in Brazil
Psychological safety (PSQ; Edmondson-derived)
What it measures
Team psychological safety reflects a shared belief that the team is safe for interpersonal risk-taking—speaking up, admitting uncertainty, and learning from errors without fear of embarrassment or punishment.
How it was validated
Foundational work linked psychological safety to learning behavior and performance in real-world teams and introduced validated survey items; subsequent research supports construct validity across contexts and adaptations.
Sensitivity/specificity
Not diagnostic. Interpreted as a team climate construct, typically at group level (though individuals’ perceptions matter for intervention targeting).
Interpretation (how Sapiens uses it)
Low psychological safety can sustain chronic stress via social threat, anticipatory anxiety, and suppressed recovery (e.g., constant monitoring and reduced willingness to set boundaries). It also predicts whether behavior change and health interventions will “stick” in a team environment.
Key limitations
Best assessed at team level; individual-level scores can be influenced by personality, tenure, and local conflict. Cross-team comparisons require caution unless measurement invariance is established.
Key sources
Edmondson, 1999: Psychological Safety and Learning Behavior in Work Teams
Mental health
Emotional Wellbeing (PHQ-8)
What it measures
The PHQ-8 measures depressive symptom severity over the past two weeks, excluding the suicidal ideation item present in PHQ-9, which can be advantageous in non-clinical and population screening workflows while still tracking depression severity. As Sapiens is not a medical product and cannot diagnose depression, we are using the word "emotional wellbeing" for this scale.
How it was validated
PHQ-8 has been validated in large population datasets and clinical contexts, showing good internal consistency and strong concordance with definitions of current depression using either an algorithm or a cutoff score.
Sensitivity/specificity
In Kroenke et al. (2009), a cutoff ≥10 is supported for defining current depression in population studies, with diagnostic accuracy comparable to PHQ-9 approaches in many settings.
Interpretation (how Sapiens uses it)
- Use as a severity tracker and risk flag.
- Elevated scores should prompt context review (sleep, overload, recent stressors) and, if persistent or high, referral to qualified clinical assessment.
- Because PHQ-8 omits suicidality, it should not be used to rule out risk when clinical concern exists.
Key limitations
Screening performance depends on population prevalence and setting; depression symptoms overlap with fatigue, sleep disturbance, and medical causes. PHQ-8 is not a substitute for a diagnostic interview.
Key sources
Kroenke et al., 2009: The PHQ-8 as a measure of current depression in the general population
Anxiety severity (GAD-7)
What it measures
GAD-7 measures generalized anxiety symptom severity over the last two weeks and is widely used both for screening and monitoring symptom change.
How it was validated
The original validation demonstrated strong internal consistency, test–retest reliability, and expected associations with functional impairment and related anxiety measures.
Sensitivity/specificity
A cutoff ≥10 is commonly used for identifying probable generalized anxiety disorder, with reported sensitivity about 0.89 and specificity about 0.82 in the original validation context.
Interpretation (how Sapiens uses it)
- Treat as a risk and severity marker; look for concordance with sleep quality, recovery physiology, and perceived stress.
- Elevated scores warrant evaluation of triggers (workload, uncertainty, health concerns) and consideration of evidence-based interventions; persistent/high scores warrant clinical follow-up.
Key limitations
GAD-7 is sensitive to general anxiety burden and may be elevated in other anxiety-related conditions; optimal cutoffs may differ by population and comorbidity.
Key sources
Spitzer et al., 2006: A brief measure for assessing generalized anxiety disorder: the GAD-7
Body system status
Gastrointestinal symptoms (GISI)
What it measures
“GISI” is commonly used to describe brief GI symptom severity indices (e.g., GI severity indices assessing constipation, diarrhea, stool characteristics, gas/flatulence, and abdominal pain). In the published literature, GI Severity Index variants have been used as symptom burden screeners (not disease-specific diagnostic tools).
How it was validated (and what that implies)
The GI Severity Index referenced in parts of the literature originated in autism/GI dysfunction research as a symptom severity tracker (e.g., Schneider et al., 2006; later modified to 6 items in some studies). Its psychometric validation is not as universally standardized across adult occupational populations as tools like PHQ-8/GAD-7/PSQI.
Sensitivity/specificity
Generally not established in a way comparable to diagnostic questionnaires for functional GI disorders (e.g., Rome criteria). Treat it as a symptom severity index, not a diagnostic classifier.
Interpretation (how Sapiens uses it)
- Higher scores indicate higher GI symptom burden.
- Interpretation should consider confounders: recent infections, diet changes, travel, antibiotics, alcohol, sleep loss, and stress load.
- Persistently high or worsening symptoms should trigger medical evaluation (alarm features always override questionnaires).
Key limitations
Non-specificity, population-specific origin in parts of the literature, and limited standardized cutoffs for adult occupational screening.
Key sources
Schneider et al., 2006: Oral human immunoglobulin for children with autism and gastrointestinal dysfunction: a prospective, open-label study
Adams et al., 2011: Gastrointestinal flora and gastrointestinal status in children with autism--comparisons to typical children and correlation with autism severity
Immune system symptoms (ISQ)
What it measures
The Immune Status Questionnaire (ISQ) assesses perceived immune functioning over the past 12 months using a brief symptom-based format (e.g., frequency of common infection-related complaints).
How it was validated
Development and validation work supports face/content validity, construct validity, reliability, and a scoring approach including a cutoff for “reduced immune functioning” in the intended use as perceived immune status screening.
Sensitivity/specificity
ISQ is primarily a perceived immune status measure; diagnostic sensitivity/specificity for immunodeficiency is not its purpose. Where cutoffs are proposed, they are intended to flag reduced perceived immune functioning, not clinical diagnosis.
Interpretation (how Sapiens uses it)
- Elevated symptom burden can reflect frequent infections, poor recovery, high stress load, sleep disruption, or high exposure environments.
- Best interpreted with sleep, stress, and recovery signals—and with clinical follow-up when symptom burden is high or atypical.
Key limitations
Recall bias (12-month window), symptom non-specificity, and mismatch between perceived and objectively measured immune function in some individuals.
Key sources
Versprille et al., 2019: Development and Validation of the Immune Status Questionnaire (ISQ)
Musculoskeletal system health (NMQ)
What it measures
The Nordic Musculoskeletal Questionnaire (NMQ) is a standardized instrument to assess the presence and distribution of musculoskeletal symptoms across body regions, commonly in occupational health and ergonomics.
How it was validated
The original NMQ development emphasized standardized symptom ascertainment and acceptable reliability for occupational screening, with extensive subsequent use and extensions (including test–retest work and adaptations).
Sensitivity/specificity
NMQ is used to assess symptom prevalence and regions affected; sensitivity/specificity for specific diagnoses is generally not the primary framework.
Interpretation (how Sapiens uses it)
- Identifies burden patterns (e.g., neck/shoulder + low back clusters) that often align with workload, sedentary behavior, training errors, sleep, and stress.
- Supports targeted prevention actions (ergonomics, movement programming, recovery).
Key limitations
Self-report symptom attribution is imperfect; it does not identify pathology, and symptom prevalence varies with job type and reporting norms.
Key sources
Kuorinka et al., 1987: Standardised Nordic questionnaires for the analysis of musculoskeletal symptoms
Dawson et al., 2009: Development and Test–Retest Reliability of an Extended Version of the Nordic Musculoskeletal Questionnaire (NMQ-E): A Screening Instrument for Musculoskeletal Pain
Habits
Sleep quality (B-PSQI)
What it measures
The Pittsburgh Sleep Quality Index (PSQI) measures sleep quality and disturbances over the past month. The brief version (B-PSQI) provides a reduced-item screening approach while preserving much of the discriminatory ability of the full PSQI.
How it was validated
The original PSQI paper established internal consistency, test–retest reliability, and diagnostic discrimination between good and poor sleepers; a commonly used global cutoff >5 distinguished good vs. poor sleepers with high sensitivity and specificity in the original validation.
The B-PSQI has published validation supporting reliability and screening utility, including reported sensitivity/specificity in population-based samples.
Sensitivity/specificity
- PSQI global score >5: sensitivity ~89.6%, specificity ~86.5% (original validation context).
- B-PSQI reports more moderate screening performance (population dependent).
Interpretation (how Sapiens uses it)
- High scores indicate poor sleep quality or sleep disturbance burden.
- Interpreted alongside objective sleep/recovery signals where available and contextual drivers (work hours, travel, alcohol, late meals, stress).
Key limitations
PSQI correlates with mood symptoms and can be influenced by depression/anxiety; it is not a sleep-disorder diagnosis tool (e.g., sleep apnea requires separate evaluation).
Key sources
Buysse et al., 1989: The Pittsburgh Sleep Quality Index: a new instrument for psychiatric practice and research
Sancho-Domingo et al., 2021: Brief version of the Pittsburgh Sleep Quality Index (B-PSQI) and measurement invariance across gender and age in a population-based sample
Food Frequency Questionnaire (FFQ)
What it measures
FFQs estimate habitual dietary intake patterns over weeks to months by querying frequency (and sometimes portion size) of common foods. They are optimized for ranking individuals and capturing typical patterns rather than exact intake on a single day.
How it was validated
Classic validation work compared FFQ results with repeated diet records over a year, evaluating reproducibility and validity for nutrient and food group estimates. Many subsequent FFQ validations confirm moderate-to-good ranking performance depending on nutrient/food group and design.
Sensitivity/specificity
Not typically defined; FFQs are not diagnostic tests. Performance is usually reported as correlation with reference methods and the ability to classify individuals into intake quantiles.
Interpretation (how Sapiens uses it)
- Used to identify dietary patterns relevant to recovery and symptoms (e.g., protein distribution, fiber/plant diversity proxies, alcohol frequency, meal timing).
- Best interpreted with awareness of measurement error; repeated reassessment improves signal.
Key limitations
FFQs have systematic measurement error (memory, portion estimation, social desirability) and can attenuate true associations; they are stronger for patterns than precise micronutrient quantification.
Key sources
Willett et al., 1985: Reproducibility and validity of a semiquantitative food frequency questionnaire
Freedman et al., 2011: Dealing with dietary measurement error in nutritional cohort studies


