Effective and efficient literature reviews: Actual data extraction

3 min read
First Published: 
Sep 2012

Key Learnings contained in this article:

Welcome to the third in our series on Literature Search and Analysis. In the first two articles we covered the search process – how to decide the best approach to your question, and then how to retrieve your data effectively and ethically. In this article we address the actual data extraction; how to get a good worksheet that best supports evaluation and analysis.

We should point out here that this advice is NOT intended for researchers undertaking Cochrane reviews, which are in a class of their own and have a set methodology for data extraction and analysis; as well as software for statistical analysis. In this series, we assume you are a beginner.

So here you are, with a huge pile of papers and some sort of report to write at the end of it that answers your research question. What’s the best way to begin to make sense of it all?

There are 5 key factors to consider for the analysis of your data that will help ensure you assess them properly:

  • Decide your parameters – go large first?
  • Spreadsheets and filters
  • Consistency in your definitions
  • An independent eye
  • Keeping the end in mind

Deciding the Parameters

The items that you want to include in your data analysis and extraction are decided by the research question and the inclusion and exclusion criteria.

The research question will undoubtedly have the most influence on the parameters you want in your spreadsheet, and the degree to which you want to analyse them. For example, if you are considering the quality of life associated with a condition, do you need to separate out:The methods used to measure it? The age of the patient? The comorbid conditions? The types of treatment the patients are receiving? If you are not very familiar with your topic to start off with, you may find that there are factors that could impinge on the answer to your research question that you were unaware of (let’s face it, this is probably the reason you are doing the literature review in the first place!) at the start of the project. The point here is to ensure that you don’t analyse a large number of papers and then find out that you have to go back and review them all again, either because you omitted to analyse them in enough detail or you overlooked an important factor that may affect the interpretation of your findings. Hence our suggestion to “go large”, at least initially, rather than keep your number of parameters (columns on a spreadsheet) small. A pilot study of about 10 representative papers, before you perform the research in earnest, can help you decide on which parameters to include or exclude.

Let’s consider the inclusion and exclusion criteria. These should have been set clearly at the beginning, but sometimes it’s only when you get right into the data that you find you should have been more stringent (or expansive)on some of these elements.

For example, we have been working with a client on a particular project that was designed to look at the methodology used to determine costs associated with a common hospital-acquired infection. The aim was to determine true costs of the infection, by looking at when and how the infection was contracted – either as a consequence of hospitalisation, or as the original reason the patient was hospitalised. However, as the data were collated it was found that, in a large number of studies, costs could not be reasonably attributed to infection from the way they were reported. In this instance, perhaps the criteria could be refined to exclude these studies.

However, more often you may find that your exclusion criteria have been too rigid, and there is so little evidence that no conclusion can be formed. In this instance you may need to go back and re-think your inclusion data – perhaps, for example, also including datafrom observational studies.

When thinking about inclusion criteria for studies, you should think in terms of levels of evidence for your own clarity of mind. While the perception of value of data at different levels has changed in recent years, it still forms a useful classification, particularly when considering inclusion criteria. This URL will give you a classification used by the Centre of Evidence-Based Medicine http://www.cebm.net/index.aspx?o=1025 which is very complex; there are plenty of others out there that are simpler. Just look for levels of evidence in Google images and you will find a large number to choose from. My advice is to find one from a reasonable source and stick to it. And of course, explain what you have done in your report methodology, which we will discuss in the next issue.

The important issues when having to rethink inclusion or exclusion criteria are:

  • Do we need to revisit all the studies again (the answer to this is usually yes, just to be sure)
  • Will the alteration in the criteria affect the analysis or assessment method? (probably)
  • Can we still answer our question with certainty? (yes or you shouldn’t alter them)


The second key success factor is the design of the spreadsheet to be used for the data extraction. The important thing here is to try to make the content of the worksheet as simple and as similar as possible without deviating from what the authors possibly meant. The aim is to be able to create something that is filterable (e.g. by patient characteristics, stage of disease, treatment etc.) Perhaps this isn’t as important when you only have 10 studies to analyse, but it’s vital when you have a large number and are trying to draw conclusions from it all.

Some simple words of advice:

  • Keep the flow of the spreadsheet columns logical and consistent with your papers, and you will find it easier to fill. The less jumping between different worksheets you have to do, the fewer the inaccuracies you will generate by putting data inadvertently in the wrong columns. If more than one person is extracting data, make sure that you have the columns in the same order in everyone’s copy of the worksheets.
  • If you are dealing with a lot of papers, use a system that keeps them under control and easily found again. Rather than put them in alphabetical order, I tend to give them a number and ensurethat the number column is on every worksheet, particularly if I have so many parameters that I need a number of worksheets to keep them manageable.
  • Keep one worksheet exclusively for the citation reference, breaking this down so that authors are in different columns to the title, the date, thejournal and citation details. It’s very helpful if you can put all this information straight into your worksheet from PubMed; thiskeeps the details accurate which will save you time in constructing your bibliography. You can pull citation details straight from PubMed without having to cut and paste, by using reference management software – and if you have this, and find it easy to use, it’s certainly an advantage. However, it’s unlikely that you will be able to add and evaluate all your parameters within such a database, unless you are using the REVMAN database used for Cochrane reviews. (And if you are, this article is WAY below your level of capability – stop reading now!) So normally another step is required to extract the information from the reference management software into your working spreadsheet; not a difficult task in itself, but make sure you know what you are doing before you attempt it. Nothing worse than ruining all your work by carrying out an extraction that corrupts your data….
  • Don’t forget copyright constraints and be tempted to cut and paste parts of the text of the article straight into your worktable – this is illegal. You can add an author’s comment in quotes if you plan to quote directly in your report, but even this is bordering on breach of copyright if too extensive. Paraphrase to be on the safe side, but make this as accurate a reflection of the authors intention as you can. See below for hints on consistency..

Consistent Definitions

Always record the definitions that you are using somewhere on your worksheet. For example, definitions of disease severity (if these are not standardised), types of procedures, etc. It can be easy, particularly where the area is complex and researchers often vary in their definitions, for these to slide during your data extraction and analysis. You will be grateful to have these in front of you when you are working on your 300th paper in the dead of night.

If you intend to do statistical analyses on your findings, rather than just a descriptive evaluation, you need to be extra careful about your parameters. And that your data entry is absolutely accurate.

Setting your definitions will be helped by…

An Independent Eye

Having another experienced person check your reasoning, your definitions of your parameters, and your first dozen or so data extractions is of infinite value.

This person should be unassociated with the project, so they bring a fresh perspective to it. Our writers, when doing these reviews, always check out the parameters and the first ten entries in two separate stages, so our clients get two chances to amend the parameters before the bulk of the work is done. And our internal reviewers have a lot to say at this point, also. Nonetheless, often a third independent look at the work in progress will discover a potential issue in the analysis that the two earlier ones did not.

In formal systematic literature reviews, a second reviewer is often part of the methodology; and what to (or what not to) include is a collaborative decision. However, I would advocate also having someone else review this, as it is easy, particularly with a very large project, for both researchers to become so involved in the study that they no longer see the bigger picture and may wander from the point, which brings me to the last key success factor: ..

Keeping the end in mind

The last thing to remember is to keep your research question uppermost in your mind at all times. It can be very easy to follow interesting but distracting side routes with the data that leave you feeling intrigued and informed, but haven’t helped you find the answer to your research question!

Our advice to “go large” with your parameters at first can encourage this; it is always a choice between wandering from the point (by including too many parameters) or having to re-analyse your papers again (because you didn’t make it detailed or comprehensive enough). Keeping your research question at the forefront will do a lot to help you steer your way through the extraction and assessment processes successfully, and ensure that your final report answers the research question as fully as possible.

At Rx Communications we have recently been developing proposals for a number of literature review projects – including one where an original review and analysis required updating before publication,

systematic literature review project

and others that were designed to find answers to fairly obscure or unusual questions. We find, at the start of the year, many of our clients looking for vendors for these types of projects, and so we thought it might be useful to share our experiences. Rx has been performing literature searches and reviews since the company began; we are one of the few boutique medical communication agencies to have a dedicated information manager who specialises in this work. With all the literature benchmarking, systematic and comprehensive reviews we have undertaken over the years, we have developed a useful 4-step thought process that helps ensure our clients get the results they want. Here’s hoping they get you the results you want too.

systematic literature reviews step 1

Step One – Decide what the question REALLY is.

Although this seems fairly obvious, it is often the make or break point of success, and we find that clients often haven’t thought through what it is they really want. This can make a tremendous difference to the size and complexity of the project: for example – is this research going to inform future drug or device developments or in-licencing opportunities? With a large future investment riding on the results of the search, you would do well to spend a little more on the literature search and analysis to ensure that your future investments will be well spent.  Is this search (for example, of study methodologies) going to inform the way future clinical trials will be conducted, or perhaps establish the basis for an HTA submission? In this case inclusion and exclusion criteria need to be very clearly defined so that only the most robust data are collected. Are you looking to publish the results as a foundation or background to your own development research?  It must be very clear that data have not been ’cherry picked‘, and that the search has not been skewed to omit unfavourable results or a competitor’s pivotal study.

On the other hand, you don’t want to end up analysing every single citation that may be identified in a comprehensive search, if all you want to know is the most common study approach taken to determine the parameter you are interested in. Part of deciding what the question is, is the issue of budget – do you need to know absolutely everything about your chosen question, or is a general indication sufficient?

Step Two – Choose the right literature resources

Once you have a better idea of what you need from the literature review, the next step is to determine which databases have the best results for the field of interest. In general, the four most important databases in the medical scientific field are MEDLINE, EMBASE, SCISEARCH and BIOSIS, but this is by no means guaranteed for any particular search requirement. For example, for health economics topics it may be better to use the HEED bibliographic database; for safety data TOXFILE might be useful; in another setting a good database to use might be the Cochrane literature reviews. Rx uses ProQuest – a search engine that accesses 64 relevant databases and that recently superseded Datastar. The search engine used will have some effect on the results, but in general it is best to use one that allows the removal of duplicates – otherwise costs increase considerably. However, if other databases or sources (e.g. abstract books, grey literature) are deemed to be important but are not available by the main search engine, these can be added to the results and de-duped manually.Normally, for a comprehensive search strategy, Rx would use the top 4 databases that retrieve the most citations; particularly if the number of duplicates is low. PubMed is a logical start because it doesn’t cost anything, but researchers should be aware that PubMed and MEDLINE are not exactly the same entities, and the search engine change from PubMed to MEDLINE can also make a slight difference to the results. Depending on the therapeutic area in question, PubMed will usually pick up between 60 and 80% of the available published full papers; hence the need to use more than one literature source if the results must be definitive. We always include some form of benchmarking for databases and search engines to get a feel for the margin of error.

Step Three – Choose the right search strings and limits

Once the right databases have been selected, the search string needs to be constructed. Although in most databases the order of the terms won’t make a difference, this must be thoroughly checked because the optimal order may differ between databases. In addition, limiting the search to just titles, abstract and title, or using a full text search will yield considerable differences.

Testing of search strings, the terms used and their order, and whether a truncated term and asterisk gives better results, is an iterative process. The construction of search strings will differ considerably between search engines, so it may not be possible to exactly duplicate the results using the same search string if the search engine is not known. Experienced personnel who understand the search engine requirements and the way each bibliographic database is indexed should undertake the construction of search strings, together with researchers who understand the subject area and can advise on alternative terms to ensure information is not missed because different terminology has been used.

Deciding the fields in which to search is crucial. Searching on only the”title” field may not pick up relevant content; nor will searching on the “abstract” field, particularly if the search is for some secondary endpoint or minor element of study results. Although searching on the full text will pick up a great deal of literature that is not useful and must be discarded, we prefer this approach because in our experience we have picked up many pivotal papers that would have otherwise been missed.

The table below shows the differences between adding terms involved in treatment (rows) or in searching on titles, abstracts or full text (columns). In this particular instance we wanted to determine how treatment responses differed in adult patients with psoriasis. This table is a sample for demonstration purposes only – so for brevity’s sake (and for confidential reasons) we have not included the full brief, methodology or study design.

Search #Search stringNo of results searching titles onlyNo of results searching titles and abstractsNo of results searching full manuscripts1MedlineEmbase(psoriasis OR “psoriatic arthritis”) AND (outcome* OR respon* OR nonrespon* OR switch* OR heterogen* OR predict* OR mediat* OR moderat*) Limits: After 1 December 2001; Humans; Adults; Language: English629334850302MedlineEmbaseall((psoriasis OR “psoriatic arthritis”) AND (outcome* OR respon* OR nonrespon* OR switch* OR heterogen* OR predict* OR mediat* OR moderat* OR treat* OR medic* OR manag* OR therap*))Limits: After 1 December 2001; Humans; Adults; Language: English221954419525

Using the example above, when the first few pages of titles from the results of rows 1 and 2 were compared, to determine if any useful literature had been missed by not including the “treatment” related terms, it was evident that approximately half of the relevant literature had been omitted.

In addition, an even larger amount of literature is retrieved if the abstract or the full text of the article is searched for the relevant terms. When the first few pages of the results from columns 1 and 2 were examined, 15 relevant papers out of the first 100 had been missed by searching on titles only. As can be seen above, it is possible that this particular search performed on titles only (which incidentally is what the NICE guidelines for literature reviews recommend!) picked up less than a tenth of the potential literature. This may not be an issue if one is looking for a simple answer to the question “IS there a difference in treatment response in adults with psoriasis?”, but if one wants to understand HOW a difference manifests and in WHAT PROPORTION of patients, then the smaller number of citations may be insufficient for an accurate answer.

Please note, we are not saying here that as a rule of thumb, you will miss 50% of relevant literature by leaving out key words, or that you will miss 15% of the relevant literature by searching on titles only.  This is our results for one search series in one therapy area on a particular day – our point is that you should clearly establish in your search strategy how much data you may be missing using your chosen method, and have made a judgement call on whether you can afford to miss it or not.

Step 4 – Use the right inclusion/exclusion criteria – a pilot phase?

Naturally, the right inclusion and exclusion criteria must be set a priori and stated in the methodology before the data extraction begins, if the analysis is to be systematic and consistent. However, a pilot phase can be a useful means of determining if the searches have in fact yielded the relevant results, and have not excluded anything important. This helps prevent a change in plan after the analysis begins, when it is discovered that the criteria are too stringent or are so broad as to waste the researcher’s time. A pilot phase has the advantage of ensuring that a single search sequence used on a particular day can then be replicated if need be by other researchers, giving you confidence that your final search methods will be transparent and repeatable. A rapidly conducted pilot study would help determine if more stringent exclusion criteria could limit the number of papers to be analysed, without losing important information. This may reduce the number of papers to a more manageable amount.

In summary then, these are the first 4 steps to get right in the lengthy process of performing a good literature review.  Good luck with your searches, and if you want any more information about how to put together a good search there are several people to talk to at Rx who would be happy to give you more insights.  For getting the question right, contact Ruth or Caroline; for deciding on databases and perfecting search strings, William is your man.  And we can all help with inclusion and exclusion criteria.

Understanding Data Retrieval

In this article we continue our discussion of the best way to conduct literature searches, review and analysis.

This is the second phase in the Literature Review process: actually getting hold of the right data. If you have been following our series, we’ve covered the search process in our first article as 4 steps: Firstly: decide what your question really is, and take the appropriate path to answer that question.  In a nutshell, don’t go overboard on highly systematic comprehensive reviews if you want a yes/no answer, but also be aware of the search process pitfalls if your question is more complex. Secondly, choose the right literature sources.  Choosing the right literature sources can be influenced by data retrieval, which is discussed in this article.  Thirdly, choose the right search strings and limits (we advise getting a professional to do this if possible). And finally, select the right inclusion and exclusion criteria. It can be helpful to run a pilot phase to confirm that the limits you have set actually work for your particular research project.  Now read on… At this point, we have assumed that you have run your searches, decided your criteria and are now getting set to retrieve your data.  Here’s where the expenses of the study start to skyrocket, so it can be very useful to know how you might be able to reduce the costs by some judicious rethinking.  You have 3 key factors to consider at this point:  Search vs. Retrieval Costs, Project Timelines vs. Cost Savings, and Copyright Compliance.  Let’s get Copyright Compliance out of the way first.

Copyright Compliance

In the current environment, it’s essential to cover the thorny problem of copyright, which most academics have been ignoring for years.  However, ignorance of copyright constraints is no excuse – for institutions that have been caught breaching copyright law, the penalties can be financially very painful. Copyright is  treated differently in many countries – and in the UK is a very convoluted and complex undertaking. So the first thing to do is to familiarise yourself with the particular laws in your own country.

Copyright Conundrums in the UK

When writing a manuscript and getting it reviewed by a colleague, the original writing is owned by the writer but the reviewer changes are owned by the reviewer – one reason why, in Rx, our writers are bound by copyright deeds that transfer everything to us so that we can in turn transfer it to our clients in the knowledge that it cannot be challenged in the future.  Most journals require authors to transfer copyright to the journal for this reason – although many authors are now challenging this stricture on the reuse of their own material.

In the UK we have a non-profit organisation known as the CLA – Copyright Licencing Agency (http://www.cla.co.uk/).  This organisation covers printed material from 29 countries and digital material from 6 countries.  Most UK medical communications companies and all research organisations should have a CLA licence – the minimum cost of which is around £130 annually or about £25 per employee for a business licence. Costs have been drifting upwards for years; our licence costs have more than quadrupled since we started business.  And as costs have risen, so has the interpretation of what is and isn’t copyright become more stringent and yet confusing, particularly for medical communications agencies such as ourselves. Cutting through the verbiage of the legalese in our CLA Licence Agreement it appears that the Grant of Licence allows us to:

  • Make paper copies and distribute them to authorised people within the UK
  • Scan Material Licensed for Scanning to produce Digital copies unless we already subscribe to a digital version
  • Make digital copies solely within the licensees intranet
  • Store licenced digital copies on the intranet for a maximum of 30 days

In the most stringent interpretation of this licence it appears that you could be breaking the rules every time you:

  • Send digital or paper copies to reviewers or writers outside the UK
  • Keep digital copies for longer than 30 days (but let’s face it – most lit reviews take far longer than 30 days to complete!)
  • Don’t pay another copyright fee for any papers you share with your client.
  • Send digital or paper copies to a consultant/author who is not within your client company.

Another highly unenforceable part of this copyright law is the potential prohibition of any annotations on the paper – i.e. highlighting sections, making notes or comments on it.  I personally find this untenable – what researcher doesn’t scribble their comments in the margins of their source material?  And many pharma clients demand annotated papers as part of their quality review and approval processes, so if this component of copyright law is upheld, the implications could be staggering. It seems that the only acceptable annotations are those for teaching purposes.  However, the legalities are evolving continuously, and the CLA are amending licences for different organisation types e.g. for public relations agencies.  I believe some of these more ambitious strictures will be lifted soon, but at present be afraid, be very afraid.  Only individual researchers using one copy of the material for personal use are exempt.

So,  in summary:

  • Find out what copyright law exists in your country and follow it
  • If in doubt, follow at least the minimal CLA requirements
  • Don’t copy and distribute your retrieved articles to all and sundry

Search vs. Retrieval Costs

Anyone who has performed searches on large database engines knows the costs of the searches alone used to be significant, even without any substantive data retrieval; however, charges for just searching seem to be moving out of fashion at last.   Instead, rather than a Pay As You Go system, commercial search engine companies are asking for monthly standing charges instead, independent of  the amount of time spent searching.  For example, when ProQuest replaced Datastar, charges altered – even searches on previously expensive bibliographic databases now cost nothing until you download documents. Fortunate academics who have unlimited access to library systems, and employees of big pharmaceutical companies with library and information managers at their service are probably shielded from search and retrieval costs to a certain extent.  Believe me, they can be horrific. So use PubMed (www.ncbi.nlm.nih.gov/pubmed/)  first as it’s free.  (We should all offer up a vote of thanks to the US National Institutes of Health who have provided us poor but keen scientists with such a magnificent resource.) However, as we mentioned in our earlier article, if you are performing searches on more than one bibliographic database, you will need a search engine that removes duplicate references from your search list, and these are the ones that cost money. Wikipedia gives a good list of bibliographic databases/search engines that cover scientific fields if you are looking for free alternatives.

So, in summary:

  • start with PubMed  ( http://www.ncbi.nlm.nih.gov/pubmed/)
  • If very poor, use the Wikipedia list to find free databases that cover your topic
  • Use a service that will remove duplicates from your searches of more than one database.

Project Timelines vs. Cost Savings

The first thing to note is that document suppliers will charge to supply documents that are freely (and legally) available, so it’s important to check whether the reference is classed as `open access’.  The easiest way to check this is to search for the document on PubMed and if it is in their database then a link to the publisher’s site is usually shown.  This link will often say whether the document is free – look for terms such as “open access” or “free full text”.  Many of the free papers are available on PubMed Central (http://www.ncbi.nlm.nih.gov/pmc/), which is a valuable resource.  Now here’s the bit you may not know: not all the free papers are marked as being free on the publisher’s link.  The only way to be sure is to proceed to the publisher’s site and attempt to acquire the paper.  At this stage you will probably be confronted with a request for payment, but in some cases the reference is available for download without any mention of it being free.  It’s also worth noting that following a search in PubMed, it’s often possible to filter the results to display only free articles. Another useful resource is the Free Medical Journals site (http://www.freemedicaljournals.com/). This is not a document supplier but simply provides information on journals that offer access to free papers. Several mainstream medical journals offer free access to papers that are over a year old – this time limit can vary but you will rarely find that a newly published article is free. So where does this leave you as regards cost savings? In our agency, we tend to decide our approach on the amount of time we have – if a project is very urgent we have to retrieve our articles by the fastest route, which generally means Infotrieve   (http://www.infotrieve.com/document-delivery-service)  or the British Library (http://www.bl.uk/reshelp/index.html).  One thing to be aware of is that the British Library will supply encrypted documents with a time limit (i.e. you cannot open the .PDF file after 30 days, so you have to print it out), which adds to the inconvenience.  And naturally, the faster the retrieval, the greater the cost.  As the costs can be over £100 per article, this is very important – is the project REALLY so urgent?  Worth checking before spending large amounts of your project budget.. We tend to try and find as many articles at minimal or no cost first, before buying the newest articles from a document provider.  Another way to save money is to spend the money to download the abstracts and titles via the search engine, and then only order the full papers for the references that are the most promising.  Some analyses (often the ones where the research question is the yes/no kind) can be performed solely or mainly on abstracts; so you can largely avoid the cost of the article retrieval.  Even for these particular analyses we tend to go to PubMed first, and only use document providers for the abstracts of papers that are not available from anywhere else.  However, all these approaches add time to the project so it is always a compromise: time or cost. The table below summarises the factors you need to weigh up in determining how to go about your data retrieval, and which choices are likely to lower or raise your costs:

Factor to considerCost-increasingCost-savingWhat is the nature of the research question?Needs in-depth analysisNeeds a yes/no answerCan we depend on abstract analysis?NoYesWhat are the timelines?ShortNot so urgentWhat percentage of the published research has occurred in the last 12 months?Most of itNot so muchAre most of the papers available in PubMed?NoYesAre there other free databases that can be used?NoYesDo the most common journals in the search have an open access policy?NoYesDo the most common journals in the search have an open access policy?NoYesDo other people require copies of the papers?YesNo

And of course, if you need any help with deciding your strategy or managing your retrieval costs, don’t hesitate to give us a call for advice.

We'll deliver straight to your inbox

We take your privacy very seriously and will never share your details with other parties.
You're subscribed! We'll send you a welcome email shortly, keep an eye out and if you don't find it perhaps check the (sometimes over-zealous) spam folder.
Oops! Something went wrong while submitting the form.
Ruth Whittington
CEO of Rx Values Group Ltd
MSc(hons), NZSRN
Share this post

Discover the Power of Communication with Rx

Embark on your medcomms journey with Rx today and experience the difference of working with a world-class medical communications agency.

Child playing in autumn leaves
Copyright Rx Communications Ltd