Sampling for Internet Surveys. An examination of respondent selection for Internet research.

 

by Nigel Bradley, University of Westminster

 

Abstract

 

The traditional methods of probability and non-probability sample selection are applicable to Internet surveys and respondents can be selected in various ways, some of which are unique to the Internet. Specific to the Internet, and important to note from a sampling viewpoint, is the self-completion method of data collection. For this article the method has been sub-divided into six categories, three web-page style questionnaires and three e-mail style questionnaires. Data that show the number of individuals with access to a networked computer are of limited use in sampling. Thirteen types of computer user are identifiable, each of which poses a challenge for sampling. Published examples illustrate how these principles have been applied in practice. Two techniques: saturation surveying and sifting, may be employed usefully in some situations. There are numerous solutions to sampling problems for Internet research and many avenues for further inquiry.

 

Introduction

 

In 1998 ESOMAR issued a guideline paper for conducting Marketing and Opinion Research Using the Internet. The guideline acknowledges that the growth of the Internet "raises a number of ethical and technical issues which must be addressed", it continues to state that "it is not practicable to discuss in detail all the technical features of Internet research". It does however advocate that the researcher "follows scientifically sound sampling methods within the constraint of the medium. This article aims to support this Esomar Guideline by discussing sampling methods that have been employed to date.

 

Basics of Sampling

 

Any survey is only as representative as the subjects chosen to be interviewed. This fact has led to a body of literature that examines how to select respondents, this literature distinguishes between probability and non-probability sampling.

 

Probability sampling is characterised by the fact that sample is selected by chance, population members have a known, and sometimes equal probability of being selected. The probability techniques which are most familiar to us include Simple Random, Systematic, Stratified and Cluster. However, at least 32 different probability techniques can be derived from combinations of the aspects of element and cluster sampling; equal unit probability and unequal probabilities; unstratified and stratified selection; random and systematic selection; single-stage and multistage techniques (Malhotra 1999:335).

 

In contrast non-probability sampling uses human intervention. Non-probability techniques include quota sampling (used very commonly in UK personal interviewing), judgement (or purposive) sampling and convenience sampling. All of these are described adequately by their names. Other types of non-probability sampling are snowballing, whereby contacts provide other respondent names, and self-selection sampling, whereby respondents "volunteer themselves" to undertake the research. One might also add plausibility sampling - "a sample selected because it appears plausible that the members are representative of a wider population, without any real evidence" (Talmage 1988:82).

 

Full descriptions of these methods are readily available in standard Marketing Research textbooks. Here it is sufficient to state that each method has its own advantages and disadvantages and some techniques are more appropriate than others, depending on the aims of the study. Furthermore both probability and non-probability techniques can be applied to Internet research and if they are used in combination, the researcher has a choice of many sampling approaches.

 

Sample Sources

 

The sampling frame is "a list or set of directions for identifying the target population" (Malhotra 1999:330). In terms of the Internet, the sampling frame could be envisaged in two ways. The first is 'Internal', whereby respondents are found on the Internet itself - either as visitors to web sites or among listings of e-mail addresses. The second is 'External' whereby respondents are found elsewhere - perhaps from panels or from paper directories; these respondents are then 'invited' to the Internet which is used as a data collection medium.

Some methods are shown in Table 2. The following sources, specific to the Internet deserve a few words of explanation.

Pop-up Surveys are questionnaires that appear on the screen. They can be used to select a random sample of visitors to a web site. Such software may trigger an invitation to participate or an actual questionnaire may appear. The appearance of this window is triggered by some mechanism, it may be after a specified time interval, after a specified number of visitors have viewed the page, it may be triggered by specific parts of a web site. It is argued that the "pop-up survey" gives a better response rate than a simple banner invitation which is "fixed" to the page (Malhotra 1999: 350-1). Both methods are currently popular among practitioners and a variation of the pop-up survey is the use of browsing intercept software. This is used "to actively intercept and redirect" a sample of visitors to a survey request page (Pfleiderer & Gente 1998:248).

If the number of visitors is used to trigger pop-up survey or browsing intercept software, the researcher should be aware that accurate web site audience measurement is still under debate (Smith 1998:562, Foan & Read 1999). Various things lead traffic to be overestimated or under estimated. These include mirror sites, cache facilities, frames and robot activity.

 

 

The term Interest Group has been used to embrace the many terms used by Internet users, these include Discussion Group, Newsgroup, Usenet, Discussion Lists, E-Mail Lists, ListServ, ListBot.

Hypertext links can be made from host sites, search engines or elsewhere.

Harvested addresses are e-mail addresses that appear on web-sites and have been collected (or harvested). This may be done automatically by hand or by software. At least one national public e-mail directory is maintained automatically in this way (see Batagelj 1999:163). The method was also adopted for an ambitious study to identify good web-site design in Europe (see http://www.smeguide.gr/results.html). It should be said that there are hazards in using this automatic harvesting method. Two software packages are enjoying popularity. One is the 'Spam Bait Creator' which automatically creates web pages with bogus e-mail addresses. The logic is that they will be harvested and reduce database quality thereby leaving list-sellers unable to operate. The second software package 'protects' the e-mail address that appears on the web-site, the visitor can use it, but robots cannot copy it. Password protected sites can also deter such automatic harvesting.

Some web sites ask the visitor to register with the site. This means that profile data is collected on Registration Forms. These forms may provide a suitable source of email addresses and a basis for deciding on sample composition. Surveys by the Graphics Visualization Unit (see http://www.cc.gatech.edu/gvu/user_surveys/survey-1998-10/) indicate about half of respondents "falsify the information" on such forms.

 

E-Mail addresses themselves, however obtained, must also be treated with caution. A certain percentage of users change their Internet Service Provider, and also their e-mail address. This "churn" can have consequences for under-representation, particularly, we might surmise, with price sensitive consumers. Additionally multiple e-mail addresses may be held, some may no longer be used (but not closed) or not checked regularly.

 

Internet Questionnaires

 

Also unique to the Internet, and important to note from a sampling viewpoint, is the self-completion method of data collection. To date Internet questionnaires have been delivered as a web page, as part of an e-mail message or as a combination of both methods. In all cases Internet questionnaires are CASI questionnaires (Computer Assisted Self Completion Interviews). The intention is to return such questionnaires to the sender on-line, although they can also be printed and returned by post or fax. Table 3 summarises the six main alternatives.

 

Web Page Questionnaires can be divided into three types. Type I is part of a web site "open" to any visitor, there is no control over who visits. This type includes the Banner invitation Type II is "closed", and respondents are invited to visit the site to complete the questionnaire, which may be password protected. Type III is "hidden, and the questionnaire appears to a visitor when triggered by some mechanism (eg date, visitor number, interest in specific page etc.). This type includes the pop-up survey. For the purposes of this article we refer to these as types open-web, closed-web and hidden-web questionnaires.

 

E-Mail Questionnaires can also be divided into three types. Type I is a "simple" e-mail message with questions. Type II is an "attachment", which is delivered with a covering e-mail letter. Type III is "URL embedded", whereby an e-mail request for participation has a URL embedded in the message. The respondent simply clicks on this hypertext link, which then evokes their web browser, presenting the reader with a web-based questionnaire. For the purposes of this article we refer to these e-mail questionnaires as simple e-mail, e-mail attachment and e-mail URL embedded questionnaires.

 

These survey approaches can be made on the Internet, but also on closed computer networks (the Intranet, the Extranet etc.), this is the reason for not limiting this article to the World Wide Web. There are advantages and disadvantages of these different approaches that have been documented (see Witt et al 1998, Frost 1998).

 

RESPONDENT CONSIDERATIONS

 

It is pertinent, from a sampling viewpoint, to note that computers with an Internet connection have different capabilities. For example, some machines can only be used for e-mail, some can only access the world wide web, some can only access intranets, some e-mail software cannot evoke a browser, some e-mail software cannot view attachments. Many television receivers that can access the Internet have such limitations. The use of web-based e-mail facilities (for example Yahoo Free email addresses) is another variation that must be acknowledged. It is also relevant to say that computer users have varied technical capabilities, presumably a function of resources available to them, their personal background and other factors.

These varied configurations of hardware, software and user ability have major implications for sampling. The commonly cited statistics, which tell us the number of network connections or number of people with access to a networked computer, have a reduced usefulness. It is valuable, from a sampling perspective, to know the numbers of users of e-mail and the number who use the worldwide web. This then leads us to seek data on how frequently users check their e-mail, how many messages are waiting ; how often they visit any web-site; whether the Internet is used at work or home or elsewhere, whether multiple e-mail addresses are operated, how such addresses are used, whether addresses are shared with other users, whether both use the same equipment and so on. See Coffey & Johnson (1998:105) for a discussion on representing infrequent users in the sample).

 

Table 4 illustrates why different computer capabilities and different user capabilities must be considered in sampling. It shows thirteen types of computer user.

 

Type 10 does not have email installed and isn't competent to use it anyway. Clearly this is not a desirable situation for the administration of an email questionnaire. Furthermore it is not as uncommon as one might think. Many companies allocate email addresses to individuals with the intention of providing appropriate training and equipment. Type 7 does not have browser software (Netscape or Internet Explorer for example) and isn't able to use it anyway, again an undesirable situation for the administration of a web based questionnaire.

At the other extreme is type 1 who is able to receive the email URL embedded questionnaire and, as a user, is able to open the web-based questionnaire and to reply. Between these extremes are variations which effectively leave respondents unable to receive or respond to questionnaires.

In any survey the researcher will encounter these different types and different target groups will contain different proportions. This serves to underline the point that knowledge of the number of people "on line" or with a "networked computer" is informative, but of limited value for sampling.

 

Published Examples

 

We now move to published sources to identify several cases which exemplify the use of the different sampling techniques, different sampling frames and different questionnaires. They also give indications of how the different computer user problem can be solved.

An example of the open web questionnaire was used in a study conducted out of Slovenia (Vehovar & Batagelj 1996). The site was "introduced to the public using login messages, WWW announcement, news in classic media". A clear example of non-probability sampling using convenience, self-selecting sampling and almost certainly snowballing. Over 1800 users linked to the questionnaire and 1200 respondents answered the complete questionnaire.

 

A similar example of an open web questionnaire was used in a study among Flemish web users (Schillewaert et al 1998). "Respondents were recruited in four different ways, namely a one page press release in a weekly business news magazine, newsgroup postings, hyperlinks from other web sites and an e-mailing". A further example of self-selection, a convenience sample, and a judgement sample. The exercise resulted in 353 respondents.

 

An example of the closed web questionnaire was used for a panel maintained by RelevantKnowledge (Sundberg-Cohon et al 1998). The panel of approximately 5000 US residents was derived from "a telephone frame, random-digit-dialling sample using selection systems provided by a nationally known supplier", this is a randomly selected panel. The recruited person was given an ID number and a URL. A questionnaire was completed and software was then downloaded for further questioning. Incentives were offered to panellists to ensure their continued co-operation.

 

An example of both an open web and a URL embedded e-mail questionnaire was described by Willner & Mayr (1999:126-127). A questionnaire was placed on the sites of advertisers and "asked only a few questions about demographics, Internet usage habits, and e-mail address". This questionnaire was used on 5-10% of visitors to the sites and appeared before the selected site appeared. The authors describe it as a random sample. After two days the e-mail addresses were used to send "an e-mail that contained a hyper-link to the second questionnaire".

 

An example of the e-mail attachment questionnaire, described as an "executable email questionnaire" was used in a UK employee survey (Walker, 1998:131). It was sent to over 400 staff members, and "all staff undertook the survey (with a small proportion without email access completing the survey on laptop computers)".This illustrates how the weaknesses highlighted in Table 3 can be overcome.

 

An example of the simple e-mail questionnaire was used in a study of subscribers to ocean division of SCIENCEnet (Walsh et al 1992:242). It was sent to 300 subscribers selected as a stratified random sample, and achieved a 76 per cent response rate. Interestingly an additional 104 people self-selected themselves into the survey.

 

An example of a simple e-mail questionnaire was described by Witmer et al (1999:148-151), in a study to examine response rates by length of e-mail questionnaires. A probability sampling approach was adopted. Firstly a list of newsgroups was created, then "inappropriate" ones were eliminated. This left 1,835 newsgroups. These newsgroups divide into 5 groups or hierarchies and the researchers "drew a stratified random sample from the appropriate newsgroup hierarchies". The strata were based on "the relative percentages of the total number of newsgroups represented by each hierarchy". This yielded 31 groups, various policies for replacement were specified, 12 addresses of newsgroup participants were selected 'randomly' from each group.

 

From these few examples, it is evident that the methods of sampling and questionnaire delivery can be modified according to the task in hand. When searching for examples it became apparent that experience to date is predominantly with non-probability methods.

 

Discussion

 

Sampling may not be appropriate for all Internet users. Indeed a technique called "Saturation Surveying" (see Turner 1989:260) could usefully be employed. This method attempts to survey all identifiable targets. The low cost of Internet research makes this possible. This method overcomes the lack of reliable sampling frames.

It is also pertinent to repeat that a sample of Internet users is only representative of Internet users. McDonald (1999) used the classic diffusion of innovation model, to illustrate that users have different profiles. The model has innovators, opinion leaders, the early and late majority and laggards. McDonald classifies C2 DEs as laggards. If such a model is correct, then this group is likely to be under-represented in Internet surveys. This may be good news for the market researcher who finds these groups are over-represented in personal survey methods, and a joint methodology could be planned.

There are other useful ways of approaching the issue of sampling. Watt (1997) distinguishes between three categories of sample: unrestricted, screened and recruited. Unrestricted are open to anyone, and suffer from poor representativity. Screened samples may be more representative. The recruited samples are likely to come from a panel and the panel method has been used with success in recent years.

 

Farmer (1998) introduced the term 'Sifting'. Sifting is used when a universe of potential respondents can be 'over-sampled'." Farmer explains that this works by taking any respondents who care to answer a web questionnaire and to reject those who are defined as ineligible. This may go some way to adjust the overall composition of the final sample and overcome some disadvantages of self-selection methods but clearly misses people who have not chosen to participate.

 

For telephone sampling, Random Digit Dialling (RDD) has the potential to provide a true probability sample (Conway 1999:312). This could be a suitable way to harvest email addresses. It can identify addresses that are used regularly and a telephone email collection stage can take place immediately before fieldwork thereby avoiding problems of "churn" described above.

 

Since Random Digit Dialling has the potential to deliver a true probability sample, perhaps this could be extended to Internet research. If email addresses could be represented as numbers, then a similar random selection could be made. This is an area worthy of investigation.

 

Another area that deserves more attention is the relationship between timing and sample composition. It is reasonable to suggest that questionnaires released at a particular time of day may lead to a particular sample profile.

 

In conclusion there are numerous solutions to sampling problems for Internet research and many avenues for further inquiry.

 

References

 

Batagelj Z & Vehovar V (1999). Web Surveys. ESOMAR Internet Conference, London Feb.1999 papers 159-176.

Conway S & Rogers S. (1999). Comparing apples and pears: are we seeing the end of valid sampling for telephone research surveys? MRS 1999 Conference Papers, UK.

Esomar (1998). Conducting Marketing and Opinion Research Using the Internet, Esomar Guideline, NL

Farmer T (1998). Using the Internet for Primary Research Data Collection. May 1998.

http://www.researchinfo.com/library/infotek/index.shtml

Foan R & Read M (1999). Website Auditing. MRG Meeting, London 10 February

Forrest E (1999). Internet Marketing Research, Resource and Techniques. McGraw-Hill, Sydney.

Frost, Fraser (1998). Electronic Surveys - New Methods of Primary Data Collection. European Marketing Academy (EMAC) Proceedings of the 27th conference. Track 5.

Malhotra NK (1999). Marketing Research. An Applied Orientation. International Edition. 3rd edition Prentice Hall, London.

McDonald M & Wilson H (1999). e-Marketing: Improving Marketing Effectiveness in a Digital World. Financial Times Prentice Hall, London.

Pfleidererer R & Gente J (1998). A New Dimension of Internet Research

ESOMAR Internet Conference, Paris, Jan 1998 Papers 165-182.

Smith PR (1998). Marketing Communications. Kogan Page, London.

Sundberg-Cohon, J & Peacock, J (1998). Projectable Internet Panels. Using

Traditional "Best Practices" in an Untraditional Environment.

ESOMAR Internet Conference, Paris, Jan 1998 Papers 219-232

Talmage PA (1988). Dictionary of Market Research, MRS/ISBA, London.

Turner WJ (1989). Small business data collection by area censusing: a field test of 'Saturation Surveying' methodology. Journal of Market Research Society Vol.3, No.2 April 1989.

Vehovar V, Batagelj Z (1996). The Methodological Issues in WWW Surveys.

Paper presented at CASIC '96, San Antonio. www.ris.org/casic96/

Walker D (1998). Email Research. A New Window of Opportunity?

ESOMAR Internet Conference, Paris, Jan 1998 Papers 117-133.

Walsh, John P. et al (1992). Self selected and randomly selected respondents in a computer network survey, Public Opinion Quarterly, Vol 56:241-244.

Watt J (1997). Using the Internet for Quantitative Survey Research. Quirks Marketing Research Review June 1997 Article number 0248. http://www.Quirks.com

Willner C & Mayr M (1999). Impact Test on the Internet.

ESOMAR Internet Conference, London Feb 1999 Papers 119-129.

Witmer DF, Colman RW & Katzman SL (1999). From Paper-and-Pencil to Screen-and-Keyboard. pp 145-161 of Jones S. (ed) - Doing Internet Research. Sage, London.

Witt K & Poynter R (1998). The Do's and Don'ts of Internet Interviewing.

ESOMAR Internet Conference, Paris, Jan 1998 Papers 165-182.

 

OTHER PAGES

 

NRBHOME / NRBWORKS / UNIVERSITY OF WESTMINSTER

 

For any corrections to information on this page contact bradlen@wmin.ac.uk