occupyresearch - wip-Survey-Visualization-Using-Just-Spreadsheet

Visualizing Some Survey Data Using Spreadsheet Tools
Even without coding, there's much that can be accomplished with everyday tools. This work in progress describes how to use a spreadsheet to crunch open ended survey responses. We'll use the ORGS data as our example (you can read more about it here).

Step One: Grab some data

Questions 16 and 17 of the survey are free text, open ended questions. The first one asks for reasons the respondent participated and the second for reasons the respondent did not participate. The data consist of six columns. We'll copy these to a new spreadsheet. Here's the headers to these columns:

	A	B	C	D	E	F
1	q0016_0001	q0016_0002	q0016_0003	q0017_0001	q0017_0002	q0017_0003
2	WHYYES01	WHYYES02	WHYYES03	WHYNO01	WHYNO02	WHYNO03
3

Let's start by creating a single field for the yes text and a single field for the no text. We'll use this formula in field G3:

=SUBSTITUTE(LOWER(A3&" "&B3&" "&C3)," "," ")

This formula collapses double spaces to single spaces as it concatenates the three response fields and converts everything to lower case. We do the same for the "no words" (the responses to "reasons I did not participate").

Step Two: Adding a Little Bit of Context

We want to break the "utterances" up into single words but we want to be sure we retain context information -- who said it, what it was said with, was it in a "yes" or a "no" response.

But before we do this we will tag our data records (rows) with some information. We'll create a response number field and a code that will tell us how many reasons for participating and how many reasons for not participating.

The record number will be a new column A and we'll just autofill. We type 1 in row 3 and 2 in row 4, select both cells and drag down.

To code the response numbers we test each of the six data columns to see if they are blank or not. We can test one cell like this:

=if(B2="",0,1)

and we can combine that formula like this

=if(B2="",0,1)+if(C2="",0,1)+if(D2="",0,1)

Then we repeat this for the three "reasons why not" columns:

=if(E2="",0,1)+if(F2="",0,1)+if(G2="",0,1)

To put these into a single index we can multiple one by ten and then add the other like this:

=10*(if(B2="",0,1)+if(C2="",0,1)+if(D2="",0,1))+if(E2="",0,1)+if(F2="",0,1)+if(G2="",0,1)

This produces a coding that looks like this

0	No yes responses and no no responses
1	No yes responses and 1 no
2	No yes responses and 2 no
10	1 yes responses and 0 no
...	...
33	No yes responses and 1 no

And so here's what we have now:

	A	B	C	D	E	F	G	H	I	J
1		q0016_0001	q0016_0002	q0016_0003	q0017_0001	q0017_0002	q0017_0003
2	RESP	WHYYES01	WHYYES02	WHYYES03	WHYNO01	WHYNO02	WHYNO03	YESTEXT	NOTEXT	CODE
3	1	blah	blah	blah	blah	blah	blah	blah	blah	33

Step Two: Converting the Text Strings into Lists of Words

We will now separate all the words. Our basic approach will be to replace each space by a tab character. The tab character has the code 9 and is represented by the function char(9). We use the spreadsheet's substitute function to change spaces to tabs:

=substitute(text, oldtext, newtext)
=substitute(h3, " ", char(9))

But we want a few other things included with each word. Specifically, we want to tag each with what response it was in and whether it was a positive or negative response. Since we have columns containing this information we will include this with each separated word. Here is how we can convert the YESTEXT

If CODE is greater than 3 then we the YESTEXT is not blank
We want to put RESP, CODE after each word. We change all the blanks to these (with tabs in between fields) AND we add same
bits at the end of the string. And we stick in a linefeed (CHAR10) at the end of each word sequence.

=IF(j3>3,SUBSTITUTE(h3," ",CHAR(9)&TEXT(A3,"0000")&CHAR(9)&text(j3,"00")&CHAR(10)) & CHAR(9)&TEXT(A3,"0000")&CHAR(9)&text(j3,"00")&CHAR(10),"")

The result of this formula will look like this:

WORD<tab>RESP<tab>CODE<LF> ...

for each word. Use the same formula in a column for the negative response.

Next copy the column and paste unformatted text in a word processor. It should look like this:

growing<tab>0002<tab>30
economic<tab>0002<tab>30
injustice<tab>0002<tab>30
and<tab>0002<tab>30
gaps<tab>0002<tab>30
in<tab>0002<tab>30

We'll do a quick search and replace for end of line (^p in MSWord) replacing it with y^p for the positive response column data and n^p for the negative.

Then well copy the results back to a new worksheet in the spreadsheet.

Removing Common Words

Next we remove common English words like "in" or "to" or "the" from the data. We'll do this by trying to look each word up in a list of what we call "stop words." We copy such a list from a web source and then take a careful look at it to be sure it won't eliminate words that might NOT be irrelevant in the context of our work. We drop, for example, words like "not," "neither," "nor" since these are likely meaningful in people's descriptions of their motivations. The stop list we use is in the worksheet "common-english-words" in the sample data.

What we want to do is tag all the words in our data that we should drop from the data set. An inspection of the data shows that we might have blank words and we'd like to drop them too. So, we want a formula that will

Tag if this word is in stop list OR this word is blank

We detect whether the word is in the stop list by looking it up on the common-english-words worksheet using the VLOOKUP function. If the function returns an error (the word is NOT found) then it's a good word. And if it is not a blank then it is a good word. Thus

VLOOKUP returns an error AND this word is not a blank

Here's the formua in sketch form -- it's value is 1 if we have either a successful lookup or the word is blank:

=if(AND(iserror(VLOOKUP(), word<>""),0,1)

Here's what it will look like in all its gory glory:

=IF(ISERROR(VLOOKUP(B2,'common-english-words'!$A$2:$B$112,2,FALSE)),0,1)

Here is what our data looks like now:

seq	word	response	yesnotype	yesorno	stopword
1	growing	2	30	y	0
2		2	30	y	1
3	in	2	30	y	1

Step Five. Adding info to preserve context and dropping the stop words

We add a new column B headed wordorder and put this formula in B2:

	A	B	C	D	E	F	G
1	seq	wordorder	word	response	yesnotype	yesorno	skipword
2	1	=IF(D2=D1,B1+1,1)	growing	2	30	y	0

We'll copy this information and paste just values in a new worksheet and sort on column G so we can discard the data that corresponds to stop words.

SKIPPING AHEAD -- details on next few steps still to be filled in.

Small taste of the data:

Reasons Why NOT

Reasons WHY

2grams Why NOT

3Grams WHY

Almost there....

Current Version of Working File is herewhyyeswhyno.xlsx ORGS-why-participate-or-not.xlsx

This file built on text from questions 16 and 17 of ORGS survey. Sheet "nowords" contains concatenated (respondents had three open ended responses -- we join these together as one string of words) responses from the question about reasons a respondent did not participate and the "yeswords" sheet contains concatenated responses to the question about reasons a respondent did participate.

The sheet "common-english-words" is a list of stop words that we remove from the data.

The sheet "realwords" is the data that remains after cleaning. Fields described below.

seq	sequence of the word within a given respondents "utterance"
word	the actual text
response	number of the respondent
yesnotype	code showing how many yes responses and how many no responses this respondent gave
yesorno	is this word from a yes reason response or a no reason response
2gram	2 word sequence (after removal of stop words)
3gram	3 word sequence (after removal of stop words)
4gram	4 word sequence (after removal of stop words)
5gram	5 word sequence (after removal of stop words)