This week, we learned some pretty cool Unix commands to help understand how we can combine programs to create more efficient code. Going through the Pipes and Filters exercise reminded me a bit of programming in R in terms of generating file objects. It's interesting to see how each exercise has actually given me a greater ability to decode unfamiliar programs. I guess fluency is the recurring theme. No need for intimidation; it seems that learning different programming languages is all about putting the pieces together. No matter the shape of each piece, they all fit together...somehow.
For this week's exercise, I re-read the assignment carefully to get familiar with each of the pieces of the shell commands the lesson introduced. For the pipes, I wanted to see what each process did, so I broke the command apart, then add each process one at a time. This was a very useful approach not only to figure out what the processes did, but also to see the behavior of each when appended to the command. It was easy to see how the process applied to the output generated by the preceding process. Fun!
Challenges
1. Explain the effect of -n
on the sort -n
command.
By including -n
in the sort
command, the inputs are treated as numerical values rather than strings. Thus, the inputs are sorted numerically. Without -n
, the inputs are treated as strings and are ordered as such.
2. What is the difference between wc -1 < mydata.dat
and wc -1 mydata.dat
?
To output the number of lines in the mydata.dat
file, we use the wc -1 mydata.dat
command. By adding <
to create the command wc -1 < mydata.dat
, the program opens and reads from the mydata.dat
file to process the line count. The program sends the contents of mydata.dat
to wc
's standard input. This is referred to as redirecting input.
3. Why do you think uniq
only removes adjacent duplicated lines? What other command could you combine with it to remove all duplicated lines?
Rather than reading the contents of the entire file before removing duplicates, uniq
compares lines as it reads them, according to this. When considering large data files, the processing required to read through the entire file and compare each line to every other line would likely take a prohibitive amount of processing effort and time. By sorting the list first, uniq
is able to do its work as it works its way down the list.
To remove all duplicated lines, you would add -u
to the command to create uniq -u salmon.txt
. You could also use the sort -u salmon.txt
command.
In a pipe, you could use sort salmon.txt | uniq
, which first sorts the list so that duplicates are consecutive. uniq
will then recognize and remove the adjacent-occurring duplicate lines.
The output of these commands is:
coho
steelhead
4. What text passes through each of the pipes and the final redirect in the pipeline cat animals.txt | head -5 | tail -3 | sort -r > final.txt
?
This pipeline command tells the shell to reverse sort the last 3 lines of the top 5 lines of animals.txt
, and generate the file final.txt
containing the output text below:
2012-11-06,rabbit
2012-11-06,deer
2012-11-05,raccoon
5. What other command(s) could be added to cut -d , -f 2 animals.txt
in a pipeline to find out what animals the file contains (without any duplicates in their names)?
A deduped list of animals can be generated using the pipeline command: cut -d , -f 2 animals.txt | sort -u
. sort -u
sorts and pulls the unique lines from the output of the preceding cut
command, resulting in the following output:
bear
deer
fox
rabbit
raccoon