Corpus Design & Selection Criteria


This document outlines the main principles adopted  to ensure the integrity and continued relevance of the Sustainability & Health Corpus to a wide range of researchers in the broad field of medicine and healthcare, and to those interested in examining the intersection of health and sustainability.

Corpus Design

A corpus is an electronic collection of texts built according to specific design criteria and for a specific purpose.

Corpus design and selection criteria vary, depending on the type of corpus being compiled. The Sustainability & Health Corpus is an open-ended corpus that is intended to grow dynamically and organically as new priorities arise and as the textual universe it aims to capture continues to expand and change. It is a freely accessible resource that is designed to support a wide range of studies on health and health-related topics, including the intersection of health and sustainability, and to facilitate the analysis of very large collections of texts and millions of running words. The difficulty of drawing clear boundaries around either of the overlapping fields of medicine and healthcare aside, the overall topic of health is broadly conceived, and the project strives to maintain an optimal balance between different types of medical discourse on an ongoing basis.

The medical field is no stranger to textual analysis. The highly regulated domain of systematic reviews in particular has propelled a variant of text analysis into a vital methodological tool in medical research. There is therefore room to explore complementary and alternative methods of text analysis. In this respect, the SHE Corpus is intended to serve as a connective and complementary tool: to allow scholars to study health and health-related discourses from diverse perspectives, while also facilitating a variety of interdisciplinary encounters.

Unlike most corpora, the SHE Corpus is not designed to support linguistic analysis per se but rather to enable researchers and students in the field of health to analyse the evolution and contestation of key concepts in their specific domain. Examples of such concepts include evidence, equity, viability, sustainable development, degrowth and preparedness, among others.

Open-ended corpora such as the SHE generally give more priority to size and currency than to applying stringent criteria to achieve representativeness and balance.

Representativeness and Balance

Representativeness and balance are key considerations in corpus design. They concern the number of texts (or tokens) we decide to include and the proportions in which we include them. As far as balance is concerned, the corpus builder has to decide whether the balance to be achieved is internal to the corpus, meaning that the proportions of different variables (document types, range of sources, etc.) should be roughly the same irrespective of their level of influence or the proportion they represent of the relevant domain, or whether it should reflect what we estimate to be the proportions of these variables in the textual universe to be represented. This is not an exact science; the idea of the process of building an open-ended corpus such as the SHE being conceived as “fluid, organic and cyclical” is therefore considered “the bottom line in corpus design” (Biber 1993:256).

In practice, then, representativeness and balance are ideals we strive for but can never fully achieve in an open-ended, dynamic corpus such as the SHE. There are several reasons for this.

First, the extent to which a corpus can claim to be representative depends on a clear definition of the population under study. But the size of the population to be represented – in this case all texts about health and health-related topics – can never be delineated. No one knows precisely how many texts are available on these topics at any one time, nor can produce a full list of all the sources they may be drawn from.

Second, for an open-ended corpus like the SHE, the textual universe we are attempting to capture is not fixed. It is constantly changing as more texts are produced and new priorities and topics arise. Consider, for example, the extent to which that entire textual universe has changed following the outbreak of Covid-19 in 2020. We believe that an open-ended corpus such as the SHE must be allowed to grow and change parameters if it is to continue to reflect a constantly changing universe and remain responsive to the needs of the research community.

Third, balance is usually defined as a measure of the internal consistency of a corpus in terms of the proportions that are contributed by each variable. This is often (but not always) understood to require the corpus builder to approximate to the actual proportions of the different types of text that exist in the domain they wish to represent. How many texts are produced by policy makers such as WHO and ECDC, for instance, as opposed to texts produced by journals such as The Lancet or by grassroots organisations such as Doctors for Extinction Rebellion or Health Poverty Action? What percentage of these texts are reports, (draft) resolutions, blogs, journal articles, books, or other formats? No one has accurate statistics on these variables at any one time, a situation that is further complicated by the fact that these proportions are not fixed given the fluidity of the entire textual universe. As Sinclair (2004) asserts, “there are no such things as ‘correct proportions’ of components of an unlimited population”.

Fourth, the attempt to improve the representativeness and balance of a corpus are further complicated by more pragmatic considerations, chief among which are copyright restrictions in the case of a corpus such as the SHE, which is designed to be freely accessible to the research community. Other pragmatic considerations include the relative difficulty of acquiring and preparing particular types of text for inclusion in a corpus. Including spoken encounters such as clinical interactions, for instance, requires addressing issues of privacy and confidentiality and involves far more investment in time and effort than including written documents. Even the heavily funded 100-million-word British National Corpus consisted of 90% written and only 10% spoken language (Weisser 2022:90; Rees 2022:394), and the Corpus of Contemporary American English is 80% written and 20% spoken language (Weisser 2022:90). This despite the fact that we are all exposed to and engage in producing far more spoken than written discourse.

Selection Criteria

Beyond the general issues of representativeness and balance, corpus builders have to select individual sources and texts on the basis of clear, transparent criteria. These may be divided into external and internal criteria.

External criteria are based on evidence external to the body of the text proper and are less dependent on subjective judgement than internal criteria. They guide the initial selection of sources and of individual texts to be included in the corpus. In the case of the SHE Corpus, external criteria include the following parameters. Details under each heading are indicative only (they do not constitute exhaustive lists).

1. Source

Specific sources are selected on the basis of their relevance to the field of health in general and/or to priority topic areas (see item 1 under internal evidence). Examples include the following:

Policy makers

WHO, UNAIDS, ECDC, CDC, Wellcome Trust

Specialist journals

The Lancet, BMJ, New England Journal of Medicine, BJGP Open


Amnesty International, Oxfam

Grassroots organisations

Abortion Rights Campaign, Doctors for Extinction Rebellion, Advocates for Youth

Online Magazines

The Conversation, OpenDemocracy, The Nation


Jason Hickel blog, Science-based Medicine

2. Document format

Reports, (draft) resolutions, journal articles, articles in online magazines, blogs, books

3. Time span

All things being equal, priority is generally given to more recent publications. But because the selection of individual items is guided by the topic areas identified as priorities for the SHE community (see below), the time span for selecting texts varies depending on the nature of each topic and the needs of a particular project. In the case of MEDRA, for instance, the starting point is 1973 for the US, 1983 for Ireland and 1985 for Argentina.

4. Region

This relates to both the origin of a document and the geographical region on which the text focuses. In this sense, region is both an external and an internal criterion.

Many of the documents in the corpus are sourced from international or pan-national organisations such as WHO or ECDC and mostly focus on the global context. Others are produced by groups and institutions located in and addressing issues relevant to a particular region. Examples include Abortion Rights Campaign (Ireland), Doctors in Unite (UK), and Asociación Médica Argentina. Different regions are prioritised for different topic areas and different projects.

5. Copyright status

Material included in the corpus must be in the public domain, published under a CC licence that allows for inclusion in an electronic corpus, or is covered by explicit permission granted to the SHE by the copyright holder.

Internal criteria are drawn from closer examination of individual texts to determine their relevance and fit within the overall design of the corpus.

1. Topic Area

For the SHE Corpus, priority is given to specific topic areas considered of particular interest to the SHE community of scholars and students. These currently include: pandemics/epidemics; health and environmental sustainability; reproductive & sexual health & rights; and adolescent & young people’s health. Priority sub-topics are identified under each area and guide the search for and selection of individual texts. For example, abortion is a key subtopic under reproductive & sexual health & rights; HIV and polio are among a number of key subtopics under pandemics/epidemics.

2. Region

As mentioned under external criteria above, region is both an external and internal criterion. What region of the world the content of a document focuses on is taken into consideration and the selection is guided primarily by its relevance to priority topic areas or to a specific project such as MEDRA or Erasmus+.

3. Text and Graphics

Unlike the vast majority of corpora, the SHE Corpus is designed to capture both textual and visual material and to provide separate or complementary access to both through the software interface (continually under development). Nevertheless, corpora are designed primarily for the analysis of running text. For a text to be included in the SHE Corpus the balance must be largely in favour of running text.

Guiding Principles for Purposeful Sampling

Purposeful sampling, a strategy extensively used in qualitative research, involves identifying and choosing cases that are rich in information, thereby maximizing the effectiveness of limited resources (Patton 2002).

The SHE Corpus guiding principles for implementing purposeful sampling are as follows:

1. Expert Judgement

Selection of documents within a specified genre or from a specific source relies on expert judgement. Initial pre-screening is conducted, followed by consultation with experts in both corpus analysis and health science. This is to ensure that what is included is relevant to the subject matter and to identify any key documents that warrant prioritization.

2. Ongoing Monitoring of Content

The selection process is open-ended and ongoing. While it is impossible to cover every document within a domain, we continue to add material until saturation is reached within a particular category.

3. Maximum Variation

The Sustainability& Health Corpus is not a corpus of the medical canon. It represents a variety of voices and is medical in content, not in terms of expertise. We deliberately attempt to represent both mainstream and non-mainstream sources and to select documents that exhibit the greatest possible diversity of voices and opinions. For instance, when exploring the topic of abortion, our aim is to ensure that our sample encompasses a wide spectrum of perspectives, including ‘pro-life’ and ‘pro-choice’ voices and various positions in between.

4. User Feedback

Given the open-endedness and ongoing expansion of the SHE Corpus, users  are encouraged to suggest additional content within any of the topic areas we prioritise. Please write to Mona Baker, Jan Buts, Gabriela Saldanha or Kyung Hye Kim.

Transparency and Accountability

Details of the full list of texts included in the SHE Corpus at any one time are readily available to the research community through the website. Click on Contents of the Sustainability & Health Corpus to access the full database.



Biber, D. (1993) ‘Representativeness in Corpus Design’, Literary and Linguistic Computing 8(4): 243-257.

Rees, G. (2022) ‘Using Corpora to Write Dictionaries’, in A. O’Keeffe and M.J. McCarthy (eds) The Routledge Handbook of Corpus Linguistics, second edition, Abingdon: Routledge, 387-404.

Patton M. Q. (2002) Qualitative Research and Evaluation Methods, third edition, Thousand Oaks, CA: Sage Publications

Sinclair, J. McH. (2004) ‘Corpus and Text: Basic Principles’, in M. Wynne (ed.) Developing Linguistic Corpora: A guide to good practice. Available at

Weisser, M. (2022) ‘What Corpora Are Available?’, in A. O’Keeffe and M.J. McCarthy (eds) The Routledge Handbook of Corpus Linguistics, second edition, Abingdon: Routledge, 89-102.