Data Organization in Spreadsheets for Social Scientists: Instructor Notes

Instructor notes

Lesson motivation and learning objectives

The purpose of this lesson is not to teach how to do data analysis in spreadsheets, but to teach good data organization and how to do some data cleaning and quality control in a spreadsheet program.

Lesson design

Introduction

Formatting data

Common formatting problems

Dates as data

Quality assurance

The challenge with this lesson is that the instructor’s version of the spreadsheet software is going to look different than about half the room’s. It makes it challenging to show where you can find menu options and navigate through.

Instead discuss the concepts of quality control, and how things like sorting can help you find outliers in your data.

Exporting data

Concluding points

Technical tips and tricks

Provide information on setting up your environment for learners to view your live coding (increasing text size, changing text color, etc), as well as general recommendations for working with coding tools to best suit the learning environment.

Common problems

Excel looks and acts different on different operating systems

The main challenge with this lesson is that Excel looks very different and how you do things is even different between Mac and PC, and between different versions of Excel. So, the presenter’s environment will only be the same as some of the learners.

We need better notes and screenshots of how things work on both Mac and PC. But we likely won’t be able to cover all the different versions of Excel.

If you have a helper who has experience with the other OS than you, it would be good to prep them to help with this lesson and tell how people to do things in the other OS.

Apple Numbers

Apple Numbers does not have data validation, which is needed for part of this lesson. A note is included in the setup instructions pointing Numbers users to either Microsoft Excel or LibreOffice.

People are not interactive or responsive on the Exercise

This lesson depends on people working on the exercise and responding with things that are fixed. If your audience is reluctant to participate, start out with some things on your own, or ask a helper for their answers. This generally gets even a reluctant audience started.

Common questions raised by participants

How do you extract date components from the interview_date field in SAFI_clean.csv?

The interview_date field in SAFI_clean.csv when saved to SAFI_clean.xlsx is difficult to manage because there isn’t a way to format the column as a date field, even using the custom field formats. The easiest solution to this question is to show the student how to extract the date information from the field. Make a new column and format it as a date. In the first cell of the new column type =LEFT(C2,10) and then apply this to the column. This function extracts the first 10 characters from the left side of the interview_date field and inserts them into a new column.

How would you automatically transform the items_owned field into a usable format?

If you are not following the course immediately with the OpenRefine lesson it is important to make it clear that in the current format SAFI_clean.csv is not ready for analysis. The items_owned column ideally needs to be split into separate yes / no / null columns. Example: set up a new column ‘bicycle’ and format it as a number. You then need to extract information from the items_owned column about whether the word ‘bicycle’ is in the column. One way of doing this is to use an IF statement: =IF(ISNUMBER(SEARCH(“bicycle”,K2))1,0). The IF statement can include a wild character e.g. “bicy*”.