Epistemic Challenges for Subsurface Engineering, Part II: Creating Value with a Hypothesis-Driven Workflow


  1. Introduction
  2. Strengths and weaknesses of different sources of information
  3. Correlation and causation
  4. A hypothesis-driven workflow
  5. Pitfalls with field data collection
  6. Organizing clean field tests
  7. Culture and mindset


In Part I of this blog post, I discussed how group dynamics enable false beliefs to arise and persist. Once embedded, false paradigms can delay innovation and lead to significant loss of economic value.

Now, in Part II, I focus on solutions. The shale industry has become dramatically more efficient over time (Abivin et al., 2020). How do we continue that pace of improvement? What concepts are most promising to pursue in the field, and how should we evaluate the results?

Ground truth comes from field testing. To assess the impact of design changes on production, we should evaluate design changes by comparing production between wells. To determine what is happening in the reservoir, we should make direct, in-situ observations at the wellbore.

We must recognize that the other tools in our toolbox – computational modeling, statistical studies, laboratory measurements, and indirect field observations – provide evidence, but not proof. They are useful sources of information, but they alone cannot firmly establish truth about what is happening at the field scale.

In this post, I outline a hypothesis-driven workflow. The workflow starts with an assessment of priorities (such as maximization of free cash flow), and by identifying the key design decisions that we can control to maximize performance. Then, physics-based and data-driven tools are used to generate hypotheses about changes that may lead to improvement. Field tests are organized to test those hypotheses.

Field tests are prioritized based on a cost-benefit analysis, drawing on the results from the physics-based and data-driven tools. Most field tests should be designed to provide insights that yield measurable improvement within 1-2 years, and then continue to deliver value over the long term.


Field testing does not necessarily require additional spending. It only requires that we coordinate field operations to enable clean well-to-well production comparisons and that we plan field data collection so that we can draw strongly supported conclusions.


In the hypothesis-driven paradigm, the goal is to think critically about our assumptions and uncertainties, and ask – “what field data could we collect in order to validate our hypotheses?” It is the engineer’s job to generate and prioritize hypotheses, design the field data collection and/or trials that test hypotheses, implement in the field, and then evaluate results and draw conclusions. Optimization is not one shot – it’s an iterative cycle: design, execute, analyze, and repeat (Starfield and Cundall, 1988).

In the absence of a hypothesis-driven approach, operators may tend to ask engineers and geoscientists to do the impossible – to provide high confidence answers to difficult questions with incomplete information. In Part I, I discussed how this mindset leads to trouble. It incentivizes engineers to be overconfident and make weakly supported claims. Then, they become locked in and struggle to change direction when problems arise.

Shale developments are particularly well-suited to a hypothesis-driven paradigm. Wells are relatively independent of one another, and geology is relatively uniform. This allows companies to test changes and receive feedback within 6-12 months.

In contrast, in conventional formations, all the wells in a field interact, geology may be more variable, and development plans often involve complex, long-term secondary or tertiary recovery schemes. It may require years to determine the outcome of decisions, and it may be impossible to ever know the counterfactuals of what would have happened if other choices had been made.

Strengths and weaknesses of different sources of information

Computational models

Computational models encapsulate our physical understanding; they are calibrated to field data, and they allow users to test ideas digitally before trialing them in the field. They help answer the question ‘why,’ which inspires new ideas, and helps us prioritize data collection. However, they may rely on uncertain model inputs and assumptions, and they require the user to exercise critical thinking.

When assessing confidence in the result from a computational model, ask: how dependent are the results on the underlying assumptions? How well-founded are those assumptions? If uncertain assumptions turned out to be invalid, would the results change? How deeply has this model been calibrated to data?

For practical applications, it is usually important to include all relevant physical processes, even if this requires a reduction in numerical accuracy. For more academic/research applications, high numerical accuracy may be the priority, even if this requires simplifying the physics.

Data-driven analysis

Data-driven analyses synthesize large volumes of information and can be used to draw useful conclusions, even in the absence of strong physical understanding. However, when applied for subsurface engineering applications in shale, data-driven analyses are usually applied to ‘observational studies.’ Thus, they are vulnerable to confounding covariates, hidden variables not included in the analysis, and ‘false discoveries’ caused by coincidence and by the use of post hoc analysis (Benjamini and Hochberg, 1995; Prasad and Cifu, 2011). Also, data-driven analyses struggle to predict out-of-sample behavior.

When assessing confidence in data-driven analyses, ask: could these apparent relationships have been caused by variables not considered in the analysis? Are key model inputs correlated in the dataset? How many different relationships were tested in order to find the reported correlations? Are there complex/nonlinear relationships that may be difficult to identify from a statistical analysis?

Laboratory experiments

Laboratory experiments allow for careful control of the physical setup and high precision measurements of results. However, it is difficult to reproduce reservoir conditions in the lab, and processes may behave differently at lab scale than at reservoir scale (McClure, 2018).

When assessing confidence in laboratory experiments, ask: how closely were in-situ conditions reproduced in the experiment? How might the differences impact the conclusions when applied to field scale? How differently does the process behave at small scale versus large scale? It may be useful to perform a formal dimensional analysis.

Correlation and causation

In the book Causality, Judea Pearl lays out the formal mathematical theory of establishing causality. A powerful tool for determining causality is random assignment.

For example, in a randomized trial, you could enroll 1000 people in a study and randomly assign half of them to perform yoga regularly. At the end of the study, you could measure blood pressure of all participants and test the correlation between yoga participation and blood pressure change. Because the yoga-participants were randomly assigned, a statistically significant correlation could be formally, mathematically interpreted as proving causality.

In contrast, in an observational study, you could sample 1000 people from the general population and compare blood pressure among people who do yoga and who do not do yoga. Perhaps you would find lower blood pressure among the yoga practitioners. But if so, this would not prove causality. Perhaps people who do yoga tend to be healthier because of other factors – such as gender, age, an overall interest in a healthy lifestyle, or diet. Because of the nonrandom assignment of ‘who does yoga’ in the general population, correlation between yoga and low blood pressure cannot prove that yoga causes lower blood pressure.

Here is a hypothetical example from the oil patch. Let’s suppose that you analyze a large dataset and identify correlation between a frac fluid additive and production. Perhaps this is a meaningful causal relationship, but perhaps not. What if companies on better acreage tend to use more expensive frac fluids and include this additive? What if companies tend to use this additive on larger jobs? What if there is one large company that disproportionately uses the additive, and that large company has either better acreage or more effective frac designs? What if companies have increased the use of the additive over time, and there have been other simultaneous design changes over time that caused a positive impact on production?

Because of these issues, in an observational study, we cannot be sure to what extent the correlation is affected by confounding covariates, and cannot prove causal relationships. Data scientists possess tools to identify and mitigate these challenges, but they are imperfect.

A hypothesis-driven workflow

To establish causality, you can run an experiment in which you control the independent variable. While this may sound complex or expensive, it may be as simple as tweaking well spacing, or proppant loading. These are things that companies often do anyway, and so there is not necessarily additional cost, aside from a bit of extra thinking and planning to ensure good experimental design.

Practically, we cannot test everything with field tests. However, for the most important beliefs, which have the biggest impact on return on investment, we can benefit greatly from putting them to the test. For beliefs that we cannot test, because of practical constraints, we must evaluate evidence from data-driven and physics-based approaches and think critically.

The idea of a hypothesis-driven workflow is to use physics, statistics, and laboratory experiments to generate hypotheses about what is happening in the field and how we can maximize return on investment. Then, we use field-testing to perform hypothesis testing of those claims. Field testing is also used to gather evidence regarding why things happen. Insight into mechanisms allows us to make more specific hypotheses, clarifies what data to collect to confirm/refute beliefs, and helps motivate new ideas.


The hypothesis-driven workflow has several benefits: (a) it mitigates overconfidence in data-driven and physics-based workflows by framing them as being merely hypothesis-generating, rather than unrealistically asking them to be absolute sources of truth, (b) it mitigates dynamics where engineers become locked into solutions and struggle to change their mind in response to new information, and (c) it facilitates genuine assessment of claims.


In a hypothesis-testing workflow, physics, statistics, and laboratory experiments remain very important. They are used to prioritize and assess which ideas are most promising and most important to test. This is critical because there are practical limits on how many ideas we can test in the field.

A hypothesis-driven workflow does not require that we test everything with field trials. If physics or data provide sufficiently confident assessments, if testing is too expensive or impractical, or the outcome of the test is not sufficiently important to our objectives, then testing may not be justified.

The hypothesis-driven workflow is:

  1. Define the overall objectives (for example: maximize free cash flow)
  2. Identify key decisions that need to be made that impact overall objectives (for example: well spacing, frac sequencing, etc.)
  3. Use engineering and geoscience approaches to generate hypotheses about how to improve performance by modifying key decisions (for example: use computational modeling, data analytics, expert opinion, field data collection, etc.)
  4. Evaluate and prioritize hypotheses; decide which to test in the field, based on assessment of probability of success and potential benefit
  5. Design field trials to test the hypotheses
  6. Implement in the field
  7. Evaluate the results, and iterate

Pitfalls with field data collection

‘Field tests’ require field data collection, but not all field data collection leads to a strong ‘field test.’ High-quality field tests require either: (a) sufficiently direct observation to unambiguously provide answers, or (b) an experimental design enabling comparison between a ‘control’ and ‘treatment’ group.

Very often, field data is not collected in a way that enables strongly supported conclusions. For example:

  • A company that gathered microseismic, observed fracture growth stopped at a certain depth interval, and concluded this was caused by the diverter that they had pumped. Their conclusion was not well-supported because they did not run a second well without the diverter to test whether the fracture growth would have been different without the diverter.
  • A company that changed many design parameters and got a moderately better production. Perhaps some of the changes helped, some hurt, and some had no effect? It is impossible to know because multiple things were changed at the same time.
  • A company that used different frac designs in different stages, and then used microseismic to evaluate the performance of different frac designs. However, proxies like microseismic are not necessarily correlated with production (Raterman et al., 2019), and so their conclusions regarding performance are not high confidence.
  • A company performed a look-back to compare production from different frac designs, but found it difficult to normalize for differences in well landing depths, artificial lift, geologic heterogeneity across the field, proximity to other wells, correlated changes in inputs, and other factors.

Because of these challenges, ‘field data’ and ‘field experience’ may or may not lead to improved knowledge. The value of ‘experience’ is context-specific.

Experience is very valuable in situations where you receive immediate and clear feedback. For example, frac engineers may learn over time how to anticipate and prevent screenout. If the job screens out, this is immediately apparent.

On the other hand, experience is less useful for answering questions where feedback is delayed or unclear. For example, “to maximize net present value, should we pump 1000 lbs/ft, 2000 lbs/ft, or 3000 lbs/ft?” Unless something goes wrong during execution, the frac engineer does not receive simple, unambiguous feedback to determine whether they pumped the economically optimal job.

In Part 1, I discussed historic examples of doctors who persisted in using bloodletting on sick patients, even many decades after trials had demonstrated that bloodletting was responsible for a high death rate among patients (Akerlof and Michaillat, 2018). In these cases, experience did not prevent doctors from making bad decisions; in fact, experience may have led to overconfidence, misconceptions, and a higher probability of bad decisions.

Design of field tests

A ‘field test’ need not cost additional money. It only requires that you structure your operations so that it is possible to make clean production comparisons between wells.

The ultimate ‘ground truth’ from field data is production. But in practice, it is not always easy to assess the effect of design changes from production data. We can benefit from thoughtful experimental design.

In order to facilitate production comparison between wells:

  1. Identify a previous well/pad/group of wells as a baseline for comparison to your upcoming wells/pad/group of wells. Alternatively, split your upcoming well/pad into groups, each receiving a different design.
  2. To control for random variability, assess the spread of random variance observed on a well-to-well and pad-to-pad basis. This should help estimate the sample size required to reach a reasonably confident conclusion.
  3. Select an ‘experimental design.’ Here are a few options:
    • Well versus well. Within a pad, use different fracture designs and compare between the wells. This method has potential drawbacks because within a pad, there can be differences unrelated to the design change: (a) between inner/outer wells, (b) due to the sequencing/order of the fracturing, and (c) because of random variability.
    • Pad versus pad. If you have four upcoming pads, use Design A in two pads, and Design B in the other two pads. This experimental design improves sample size relative to ‘well versus well.’ However, because the pads are spatially in different locations, there is some risk that geologic variability causes observed differences.
    • Well versus well, across pads. If you have four upcoming pads, use Design A on Wells 1 and 2 in each pad and use Design B on Wells 3 and 4 in each pad. Compare the relative performance of Wells 1 and 2 and Wells 3 and 4 within each pad, and then look at the distribution of differences across the four pads. This experimental design may be the best option, because it controls for geologic variability and achieves a larger sample size.
    • Within the framework of these designs, you might consider randomizing selection of which wells/groups receive each design. This removes any risk of unconscious selection bias.
  4. To the extent practical, try to change one variable at a time (unless the variables are necessarily linked together, such as perf design and cluster spacing). If you change multiple parameters simultaneously, it can be unclear which change caused observed changes; or they may offset.
  5. Unless you plan to use technology allowing stage-by-stage production allocation (such as dip-in fiber), use uniform fracturing design along the well. Without stage-by-stage allocation, it becomes ambiguous how to assess the impact of stage-by-stage changes. Indirect proxies such as microseismic may not be sufficiently reliable to draw confident conclusions about stage-by-stage allocation.
  6. Try to produce wells similarly. If wells vary in their artificial lift strategy, they will be more difficult to compare.
  7. Define standards for assessing proof prior to performing the test. Post-hoc exploratory analysis can suffer from a high ‘false discovery rate’.
  8. In some cases, it may be useful to design a bespoke set of field operations designed to test a hypothesis. For example, injection at pressure held slightly below Shmin can be used to test the viability of shear stimulation in a particular formation (McClure and Horne, 2014).

Very often, we will gather diagnostic information to supplement field tests. When evaluating this data, consider that:

  1. Direct wellbore observations provide the strongest proof. ‘Direct’ observations include production data, wellbore pressure, fiber optic, downhole imaging, and core-through.
  2. Remote imaging (such as microseismic) is useful, but provides weaker evidence than direct observations. Results are derived from complex, non unique processing based on simplifying assumptions. Remote imaging should be validated and revalidated against hard data whenever possible. Evaluate your confidence in interpretations on a case-by-case basis.
  3. Focus interpretation and prioritize data collection based on what matters most to your objectives.

Culture and mindset

Here are some ideas on how to maintain a healthy mindset and culture:

  1. Look at the big picture and ask ‘what is happening and why?’ Focus on questions that matter most for your objectives.
  2. Be ready to change your mind if you encounter new information. Encourage a group culture where it’s ok to change your mind.
  3. Be intentional about fostering a culture of continuous improvement.
    • Try to avoid situations where people become committed to certain approaches or perspectives and take things personally. You need open, nonjudgmental conversations between people on your team. This is not easy to achieve, but if you openly discuss these issues and establish values, this can help.
    • Create an internal ‘best practices’ document. Every three months, revisit the document, discuss it, and make changes.
  4. When evaluating a service provider (or an employee), look for critical thinking and ‘intellectual humility.’ Everybody is going to be wrong about something sooner or later. People that can recognize issues, and respond in the face of new information, will achieve much better outcomes in the long run. Someone with intellectual humility listens to people who disagrees with them and carefully considers what they have to say. You can be confident in your beliefs and still have intellectual humility, as long as you are willing to engage with people who disagree with you.
  5. Seek a diversity of perspectives. Everybody is biased one way or another. Mitigate groupthink by actively seeking out different viewpoints. If you pull one interesting nugget out of a conversation (or a blog post), then it was worthwhile!


  1. A hypothesis-driven workflow places field testing at the center of the process of continuous improvement. It mitigates overconfidence and the adoption of false beliefs by framing physics and data-driven modeling as being hypothesis-generating, and framing field testing as hypothesis testing.
  2. Causal relationships can be proven by performing an experiment in which you control the independent variable. Consequently, field testing is the best way to assess beliefs about field-scale processes. It is less reliable to rely solely on statistical lookbacks or physical modeling.
  3. Gathering field data does not necessarily result in a well-designed field test. Quality field testing requires either the gathering of high confidence direct measurements and/or careful experiment design to enable clean comparisons between a ‘control’ and ‘treatment’ group.
  4. A culture of intellectual humility can help drive innovation.


Thank you very much to colleagues who provided valuable feedback on this blog post.


Abivin, P., Vidma, K., Xu, T. et al. 2020. Data Analytics Approach to Frac Hit Characterization in Unconventional Plays: Application to Williston Basin. Paper presented at the International Petroleum Technology Conference, Dhahran, Saudi Arabia, IPTC-20162-MS.

Akerlof, George A., and Pascal Michaillat. 2018. Persistence of false paradigms in low-power sciences. Proceedings of National Academy of Sciences. 115(52).

Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 57(1), 289-300.

McClure, Mark, and Roland N. Horne. 2014. Characterizing hydraulic fracturing with a tendency for shear stimulation test. SPE Res Eval & Eng 17(02), 233–243.

McClure, Mark W. 2018. Bed load transport during slickwater hydraulic fracturing: insights from comparisons between published laboratory data and correlations for sediment and pipeline slurry transports. Journal of Petroleum Science and Engineering 161: 599-610.

Prasad, Vinay. 2011. Medical reversal: Why we must raise the bar before adopting new technologies. Yale Journal of Biology and Medicine 84, 471-478.

Raterman, Kevin T., Yongshe Liu, and Logan Warren. 2019. Analysis of a drained rock volume: An Eagle Ford example. Paper URTeC-2019-263 presented at the Unconventional Resources Technology Conference, Denver, CO.

Starfield, A. M., and P. A. Cundall. 1988. Towards a methodology for rock mechanics modeling. International Journal Rock Mechanics and Mining Science & Geomechanical Abstracts 25(3), 99-106.

Learn why both independents and supermajors alike trust ResFrac