Handling data

Monday, 31 October 2016

Today we are going to discuss the creation of data and learn how manipulate data structures.

We will learn some things about using pipes to redirect output and learn some commands for working with data.

Data

So, data?

What is data?

Rather, we should ask: "What are data?"

datum, data, n - something given (past participle of the verb, dare, "to give").

Where does it come from? What do we use it for? What does it all mean?

The major question that we are going to be asking ourselves here is "How are we going to get data into and out of different formats?"

We will start with lists of similar data and then move to structured and ordered sets of lists (tables).

Eventually we will consider linked sets of data in the form of databases.

Raw data

"Raw" data is sort of an oxymoron. There is very little data available that is actually really raw in the sense that it has not been touched, manipulated, massaged, curated, or cleaned by some human intervention.

Remember, even data that is available on the web is not raw, it is text that we have marked up and structured in specific ways. However, web data can stand in as an analog for raw data.

The process through which we might gather data via the web is referred to as "scraping." A "scraper" is a program that reaches out into the web and grabs all of the text (including markup) available at a URL and saves it in some meaningfully structured way.

We're not going to dig into web-scraping too much, but I want you to be aware of how data can be gathered on the web.

One tool that can be used for web scraping is our friend, wget. We've used it to download remote files, but it can also be used to get whole websites and all of the data linked from them.

This can be useful for mirroring a website. It can also be useful in aggregating unstructured data so that it might be manipulated into structured data.

Structured data

One simple fomat for structured data is a table.

Rows in the table represent individual cases or instances of something.

Columns represent variables.

What is the difference?

In the data that we are going to create in class, our rows will represent individual people. The information contained in these rows will be given ("datum, a thing given") to us by every member of this class. The columns will represent a specifically defined aspect of data that we gather about every individual person.

We will start with making our own individual lists and then aggregate them.

The humble and mighty CSV

Lists

We'll start with a list of data.

Create an assignment 4 workspace on Cloud9. Open a new file and name it with your GitHub user account and the extension .list.

Mine will be eah13.list.

Inside the file, I want you to give one-word or numerical answers to the following (as specified), in this order, each on their own line:

If any answer doesn't apply to you, type NA ("not applicable").

My file will look like this:

eah13
175.26
06:00
2
344.4

We now have semi-structured data! Very simple.

Comma Separated Values (CSV)

Now that we have listed some information about ourselves, lets try to aggregate our data.

If we want to put all of our data together as it is, we will just end up with a super long list that is difficult for us to use in any meaninful way. If we take our list and flip it, so that we have a single line instead, we can then stack all of our data up together. We can separate the elements in the list with commas (or tabs, semicolons, pipe separators, or some other marker) and then we will have a row of what will become a Comma Separated Values file: structured data.

We can do this by hand, but that is boring.

Let's learn a command to do this:

paste -d, -s example.list

paste sequentially reads the lines from a file and then writes them out in the same sequence, separated by something (tabs, by default). In this case we are asking it to read every line in our file, and then write it out separated by a comma (-d,). The -s tells paste to serialize its operations instead of parellelizing them.

So our standard ouput (STDOUT) from the above command will be:

gh-username,height,wakeup,semesters-left,hometown-distance

Output redirection

To get this into a file, we will use one of several forms of output redirection.

Output redirection is simple. It merely allows for the echoed output of one file to be put into another file. We can use programs on top of this to manipulate that output.

For example:

paste -d, -s example.list > example.csv

This will take the output from the first part of the command and overwrite the CSV file specified in the second part.

Sometimes you don't want to overwrite a file each time. In that case, this command will append the output to the file instead of overwriting it:

paste -d, -s example.list > example.csv

Pipes

A "pipe" is an operator that tells a program to take output from another program. You'll find it on your keyboard as SHIFT+.

Pipes translate the output of one program (STDOUT) into being input for another program (STDIN).

For example, if we wanted to count how many lines were in our csv file, we could run:

cat example.csv | wc -l

Groups

We're grouping up for the next assignment. Here are the groups:


HigFig
tfrahm
kelhammer
===
gma96
minorfires
jpanken
===
aehaney
brynnaw
ErinGray19
===
celineyuwono
cjayscue
ohreagano
===
colergibson
gavvy
jamiemramos
===
emmacai
jpueb96
dylanjtastet
===
efcline
sanjkris
sarecht
===
danielevanday
cltomli
ectomli
===
pillaim
ldinkins

For this first part of the assignment it's very important that you each use unique filenames and don't modify each other's files. Otherwise we'll get conflicts, which will be a bummer. When you're in your groups:

Everyone should now have a file for each group member, named after their github username. Once you're all there, move on:

Your repository (and your Cloud9 workspaces) should have one base CSV file for each group member and one `-all` csv file for each member.

Alright we've got some basic collaboration done, hopefully without conflicts! You can use the rest of class to work on Assignment 4.

For next time

Next time, we are going to work in groups to learn to create and aggregate data using scripts. In your groups, you will write a script that asks the above questions of the user and then appends their answer to a CSV file. This will be the basis of the next asssignment, which will be a group assignment.

I would like you to review some commands for working with a CSV file including how pipes work.Connelly, Brian. “Working with CSVs on the Command Line.” http://bconnelly.net/. Last modified September 23, 2013. http://bconnelly.net/working-with-csvs-on-the-command-line/.

I would also like you to watch the following video on working with CSV files. I think that it might be very helpful for those of you who are interested in the extra credit. Try watching it once and then following along a second time.


Handling data - October 31, 2016 -