Fork me on GitHub

Thu-Mai's Pipes and Filters

by Thu-Mai Christian

07 Mar 2014

This week, we learned some pretty cool Unix commands to help understand how we can combine programs to create more efficient code. Going through the Pipes and Filters exercise reminded me a bit of programming in R in terms of generating file objects. It's interesting to see how each exercise has actually given me a greater ability to decode unfamiliar programs. I guess fluency is the recurring theme. No need for intimidation; it seems that learning different programming languages is all about putting the pieces together. No matter the shape of each piece, they all fit together...somehow.

For this week's exercise, I re-read the assignment carefully to get familiar with each of the pieces of the shell commands the lesson introduced. For the pipes, I wanted to see what each process did, so I broke the command apart, then add each process one at a time. This was a very useful approach not only to figure out what the processes did, but also to see the behavior of each when appended to the command. It was easy to see how the process applied to the output generated by the preceding process. Fun!

Challenges

1. Explain the effect of -n on the sort -n command.

By including -n in the sort command, the inputs are treated as numerical values rather than strings. Thus, the inputs are sorted numerically. Without -n, the inputs are treated as strings and are ordered as such.


2. What is the difference between wc -1 < mydata.dat and wc -1 mydata.dat?

To output the number of lines in the mydata.dat file, we use the wc -1 mydata.dat command. By adding < to create the command wc -1 < mydata.dat, the program opens and reads from the mydata.dat file to process the line count. The program sends the contents of mydata.dat to wc's standard input. This is referred to as redirecting input.


3. Why do you think uniq only removes adjacent duplicated lines? What other command could you combine with it to remove all duplicated lines?

Rather than reading the contents of the entire file before removing duplicates, uniq compares lines as it reads them, according to this. When considering large data files, the processing required to read through the entire file and compare each line to every other line would likely take a prohibitive amount of processing effort and time. By sorting the list first, uniq is able to do its work as it works its way down the list.

To remove all duplicated lines, you would add -u to the command to create uniq -u salmon.txt. You could also use the sort -u salmon.txt command. In a pipe, you could use sort salmon.txt | uniq, which first sorts the list so that duplicates are consecutive. uniq will then recognize and remove the adjacent-occurring duplicate lines. The output of these commands is:

coho
steelhead

4. What text passes through each of the pipes and the final redirect in the pipeline cat animals.txt | head -5 | tail -3 | sort -r > final.txt?

This pipeline command tells the shell to reverse sort the last 3 lines of the top 5 lines of animals.txt, and generate the file final.txt containing the output text below:

2012-11-06,rabbit
2012-11-06,deer
2012-11-05,raccoon

5. What other command(s) could be added to cut -d , -f 2 animals.txt in a pipeline to find out what animals the file contains (without any duplicates in their names)?

A deduped list of animals can be generated using the pipeline command: cut -d , -f 2 animals.txt | sort -u. sort -u sorts and pulls the unique lines from the output of the preceding cut command, resulting in the following output:

bear                                                                                                                                                                                            
deer                                                                                                                                                                                            
fox                                                                                                                                                                                             
rabbit                                                                                                                                                                                          
raccoon


Thu-Mai is a SILS PhD student and a data archivist at the Odum Institute. I will solve all of your data management problems. Find Thu-Mai Christian on Twitter, Github, and on the web.
comments powered by Disqus