“I Never Ran Two Miles in Combat”
And Other Terrible Arguments on How to Test Fitness
Everyone has an opinion about how the military should test fitness. Mysteriously, these opinions tend to align with whatever they happen to be good at. One of the most common arguments is also among the most misguided: “I never ran two miles in combat, so we shouldn’t have a two mile run in the test.” The critics making this argument are too focused on simulation, rather than test validity. It’s time to break down the purpose behind fitness assessments, and hopefully dispel some of the persistent confusion around them.
In athletics, fitness assessments serve several purposes. Most importantly they measure a component of fitness that is relevant to the desired performance and serve as a guide for programming future training. Because sports happen in very controlled environments, sports scientists can precisely quantify the physical demands, allowing for very specific fitness assessments. Soccer teams often use the yo-yo intermittent recovery test or the Manchester United running test because of the frequent starting and stopping and high total running required in soccer matches. Basketball and volleyball require a high level of jumping ability, so they place a high degree of emphasis on vertical jump heights.
This becomes difficult with the military, since the physical demands of both daily job tasks and combat are far less predictable. Physical demands vary widely between jobs, missions, and environments. The military also has two additional considerations: First, assessments are often tied to career opportunities, creating powerful incentives to train in a particular way, so they should encourage the desired training. Second, they frequently serve as enforcement mechanisms where failure to meet certain standards results in administrative punishments up to and including separation from service.
Because of the complexity of combat, it is important to understand the distinction between simulating a task and measuring a component of fitness. Combat task simulations often include sled (or rescue dummy) drags, crawling, negotiating obstacles, and loading weights onto a platform. A dead giveaway that an assessment is a combat simulation is when it is done in body armor. Each of these events rely on several components of fitness. While different organizations have slightly different lists and definitions, generally accepted components of fitness include things like:
Aerobic Endurance
Anaerobic Endurance
Flexibility
Balance
Body Composition
Muscular strength
Muscular endurance
Power
Speed
Agility
A sled drag, for example, incorporates elements of muscular strength or muscular endurance (depending on the weight), and anaerobic or aerobic endurance (depending on the distance/duration). Negotiating obstacles introduces elements like power and balance. Fitness assessments that are obvious simulations of combat tasks may seem superficially appropriate, but if we return to the purpose of assessments they fail on several criteria.
First, however, we need to take a quick detour and talk about different types of validity. In research validity is “the extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world.” There are many different types of validity, but two are most important for this discussion.
Face Validity is “the extent to which a test is subjectively viewed as covering the concept it purports to measure.” In simple terms, if an outside observer thinks a test “looks like” it measures the right thing, then it has face validity. For better or for worse, combat simulations tend to have higher face validity among military populations than traditional tests of fitness.
Construct Validity is “how well a set of indicators represent or reflect a concept that is not directly measurable.” This may sound strange, but it is common in many scientific fields. There are many tests to measure the construct of stress, but there is no objective measure for stress (in other words, stress is a “construct” that we measure indirectly). Similarly, all the components of fitness listed above are constructs, we can point to examples of them but there is no one objective measure for any of them.
This brings us back to assessment design, and we can use a concrete example to simplify things. The sprint drag carry event of the ACFT has high face validity because casual observers can immediately see its relevance. It simulates common tasks like sprinting between cover, carrying ammo cans, and dragging a casualty. Unfortunately, this mash up makes it very difficult to say what it measures as an assessment. One soldier may struggle because of a lack of lower body strength, while another may be plenty strong but lack anaerobic endurance. Despite its face validity, the sprint drag carry suffers in both construct validity (what is it actually measuring?) and as a diagnostic tool for planning future training (without further testing, you don’t know what that soldier needs to focus on).
The Army’s own materials state that the sprint drag carry tests muscular endurance, muscular strength, anaerobic power, anaerobic endurance, balance, coordination, agility, flexibility, and reaction time. If it tests all those things, how does a soldier know what they need to work on based on their results?
Every time soldiers take an ACFT they are told “The results of this test will give you and your commanders an indication of your state of physical readiness and will act as a guide in determining your physical training needs.” If this is truly the case, then construct validity of the test events is crucial. Replacing the sprint drag carry with an event more focused on a particular component of fitness (for example, the 300yd shuttle run is a well studied assessment of anaerobic endurance) would allow the results to be more easily interpreted.
So what does this all have to do with running (or lack thereof) in combat? Every military service includes a run in their physical fitness assessment. The Air Force and Navy run 1.5 miles, the Army runs 2 miles, and the Marine Corps runs 3 miles. In fact, the Department of Defense mandates (in DoD Instruction 1308.03) that all of the services test aerobic fitness. Separate from any combat relevance, aerobic fitness is a strong predictor of health, longevity, and injury risk. But research has also demonstrated that aerobic capacity influences combat performance in a variety of ways, even if extended running is rare in combat.
At the most superficial level, combat requires repeated bouts of activity at a variety of intensities. This might look like long, slow dismounted patrols punctuated by the intensity of reacting to contact. Aerobic capacity determines soldier’s ability to recovery between bouts of intensity, and to sustain performance over long periods.
A more specific example is illustrated below from Vince Paikowski’s analysis of Rangers’ performance shooting under fatigue. Aerobic capacity (as measured by 12-mile ruck times) correlates strongly with how well these Rangers were able to maintain marksmanship performance under fatigue. The more aerobically fit they were, the less their marksmanship suffered.
Fundamentally, the run is not included in military fitness assessments because service members are expected to run long distances in combat. The run is included because it is a field expedient (simple, no equipment, etc.) test of aerobic capacity, which is crucial to the health and performance of all service members, and will serve them well in combat scenarios.
Here are some examples from research on how running, as a test of aerobic capacity, serves as a reliable and valid predictor of performance on combat tasks:
“Of the physical fitness component groups evaluated (2-mile run, sit ups, push-ups, jump tests, squats, sprints, pull-ups, grip tests, arm lifts, curls, and various extension machine tests) aerobic capacity is most strongly correlated across the greatest number of military tasks.” - Hauschild et al, 2014
“Aerobic capacity (assessed via treadmill graded exercise test protocol) and anaerobic capacity (assessed via Wingate protocol) accounted for the most variance in time to completion of the warrior task simulation test.” - Huang et al, 2018
“The mean work intensity in the measured military tasks was close to 50% of soldiers' maximal aerobic capacity, which has been suggested to be the maximal limit of intensity for sustained work. As a practical implication of the present and previous studies, it can be concluded that the minimum requirement of VO2max for army soldiers seems to be 45 to 50 mL kg−1 min−1.” - Pihlainen et al, 2015
“It can be concluded that the 2-mile run test protocol is fairly accurate and valid to predict the VO2max values in male military participants. This field test is also applicable to a great number of participants, taking into consideration the variability in age and beginning level of physical preparation for every soldier.” - Goran Sporis, 2013
So how does the run perform as an assessment?
Measure a relevant component of the desired performance: Reams of research validate the 2-mile run as a valid assessment of aerobic capacity.
Guide future training: Many tools are available for determining appropriate training prescriptions (target interval times, training paces, etc.) based on 2-mile run times.
Encourage desired training: This event encourages cardiorespiratory endurance training, which has demonstrated value for both health and combat performance.
It is for the same reason that events like the deadlift are relevant even when soldiers are unlikely to encounter hex bars or barbells on the battlefield. We could rely on face validity and suggest that it simulates lifting and carrying a litter to transport a casualty, but if we focus on construct validity we see that the lower body strength it tests is relevant to far more tasks soldiers will encounter in both their daily tasks and combat.
Even the standing power throw, which suffers from the least face validity of all the ACFT test events, does have construct validity to back it up. Several studies (they mostly refer to it as the backward overhead medicine ball throw, or BOMB throw) conducted over the last twenty years in populations ranging from football players to firefighters contribute evidence that it is a reliable and reasonably valid assessment of total body power. It does have some notable drawbacks, especially a strong learning effect (indicating a large skill component), but it has performed favorably as a field expedient assessment of power.
There is no perfect fitness assessment, particularly for something as nebulous as combat, but some relatively simple criteria can help us understand what makes an assessment effective. Although face validity might lead us to choose combat simulations, an understanding of construct validity shows that testing components of fitness is more valuable. Most importantly, the results of these assessments must be used to guide training to improve health, safety, and performance.