Natasha's Final Project

by Natasha Vazquez

22 Jun 2017

Here is my final project:

For my final project I decided I wanted to do the data analysis project. I started by downloading test data for inspiration on how to get started with the code. There were some Github posts and Reddit subs on free datasets online. Using that as a springboard I decided to use WHO data because I’m into public health. The first data file that seemed interesting to me had data broken down by region and year, so I decided to make a tool based on those.

This was the final, revised milestone list:

  • read user file (csv format)
  • read headers in user file
  • pull relevant variable based on user input (based on file headers)
  • create menu for 3 analysis options
  • create analysis one skeleton code
  • menu/user instructions
  • accept user input for necessary information
  • present available regions from data
  • present available years from data
  • present available variables from data
  • create analysis two skeleton code
  • create analysis three skeleton code
  • create visualization for analysis one
  • create visualization for analysis two
  • create visualization for analysis three
  • incorporate regular expressions to allow flexibility on user input

Stretch goals:

  • implement capability to do multi-region analysis for percent change
  • make the descriptive statistics analysis flexible to compute either one region or multiple regions so it can be one menu option
  • allow percent change (analysis one) to be computed at the intervening intervals between the user’s chosen end and start date

First, I sat down and mapped out all the things I would need to do to get started on the data analysis. Those were: open the file; read the lines of the file; and pull the header row of the file (first row in the file). Then I got down to coding how to find the specific columns for region and year. Because I wanted this program to be a little flexible on the input data file, I created the program such that the it looks for the titles “region” and “year” in the header row of the data file; this means that the user doesn’t have to reorganize/revise the data file too much before opening it in the program. After getting all that set up, I added in the user interface that allowed the user to control what kind of analysis they wanted and worked on the code for the types of analyses. This also involved a lot of thought on aesthetics/how the text-based interface and results are presented to the user. As part of this, I worked on how to present the results of the analysis to the user. I was able to take the data, add it to a list of lists, and present it in a tabular format. After thinking about different text-based ways to present the data, I decided to do a graph using matplot. At the time I began working on that code, Trinket wasn’t supported Pygal for Python 3, so that’s why I went with matplot. I decided to plot all data over time because that made the most sense since the types of data I anticipate using the program for are over time; also, there would only be two data points for a percent change analysis and nothing to really show for the descriptive stats. Also, as a note, graphing the data actually helped me to find an issue with my data file! I was able to visually see that something wasn’t quite right and was able to go in and fix that.

I ended the process by cleaning up my code and doing some additional testing. This included going through and leaving comments in my code for readability, because I hadn’t done so everywhere when I first began. I also went through and tried to handle errors as much as possible; for example, after talking with my group during class one day (specifically talking about Aaron’s code), I decided to add some instructions for the user as well as code that would check to make sure that the user only entered a CSV file. Also, after discussing some problems I was encountering in my program with Elliott, I learned that I had done for loops non-idiomatically in some places, so I went through and revised those where I could.

The main change in my milestone list over time was that I originally thought my program would have 3 separate types of analyses, but I ended up merging the second and third analysis as a result of implementing my stretch goal of doing multi-region analysis. There were some smaller goals/changes made along the way; for example, after talking to my group for the first time I decided to not only present the regions and years represented in the data file, but also the variables (i.e., all the remaining headers). Another example is that after I created the graph, I realized that if the user did a multi-region analysis, they would need a legend; so I read through a lot of the examples available on matplot documentation and tested out some of the code on my program. The only milestone I wasn’t able to successfully “check off” was the last one: incorporating regular expressions to allow flexibility on user input. While I was able to get this working for the set up of the file (i.e., finding the location of the region and year columns regardless of case), I had some trouble getting it to work with the user input. I also didn’t get to my last stretch goal. Overall, I’m proud of the work I did on this project! I got to use and explore a lot (if not all) of the concepts we learned about in class and also learned a little about matplot.

Problems/challenges I faced during this project:

  • Controlling flow of the program/remember what functions were returning.
    • For example, when I was working on the beginning set up of the program (opening the file and finding the headers), I kept forgetting that I needed to first run the function that populates the list of headers (which would) before doing any other work with the headers. -Another example of this is with building the graph in matplot. Because it takes a while for the graph to build and print out to the user, I tested out different places in the code of where it should occur to have the best presentation possible. I ended up adding it into the codes for the analyses, BEFORE printing out the data. This allows the printout of the data to pop up ahead of the graph without too much of a delay in what’s shown to the user. I also decided to added a time.sleep() command after doing the analyses so that the prompts don’t show up too quickly (i.e., before the graph shows up). -Changing data sets. -When I first started building the program, I just used one data set, so I was only testing out the functions on that data set. Once I introduced another dataset I was faced with some issues since the datasets were formatted a little differently. However, that ultimately made my program stronger because now it should be able to handle different datasets. -Breaking up the code into modules -While this was fairly straightforward for the most part, there was one function/variable (getting the variable to result) that logically should have worked in one of two different modules (setup or analysis), but I could only get the program to work in the analysis module so that’s where it ended up, even though it arguably “fit in” better with the functions in the setup module. -Working in a custom class -Because I had already done a lot of the work on the project before we learned about extending or creating classes, I had to kind of force in a custom class into the code. But I ended up extending the class List to Data, that allowed me to do some functions on lists that I knew I’d be doing fairly frequently in the code.
  • Implementing flexibility in user input -Having not done much work with regular expressions previously, I re-read the chapter on regex as well as read much of the Python documentation for regular expressions.

Skills/attitudes I used during this project:

  • Phone a friend (getting fresh eyes on code issues I was having; see project update 1 post for example)
  • Reading up on the methods on Google or the respective documentation (i.e., Python, matplotlib, etc.)
  • Trying things out on smaller trinkets first (this was particularly useful if it could parse out the code, not as useful if it was too inter-connected to other pieces of code)
  • Printing the output as I go along to make sure everything was working as is.
  • Writing out logic/flow on pen and paper when I was stuck.

Lessons learned from this project:

  • Don’t over-diagnose problems: -After talking through an issue I was having with Elliott, because I had “over-diagnosed” an issue, I overlooked the issue that was causing the problem! -Read more of the documentation of a module before using it if you can
    • If I had read more of the documentation for matplot I might’ve been able to figure out the issue I was having with clearing the figure -Make sure to know how variables are stored -I was trying to compare two variables that were stored differently (one was stored as a string and the other was stored as a )

If I were to continue work on this project, my next steps would be to:

  • Attain those last two goals!
    • I really would like to implement some flexibility on the user input so that they wouldn’t have to waste so much time ensuring they’re typing in everything exactly correctly, but I don’t like I know enough about regular expressions at this point to have that work in a simple way.
    • I would also like to attain my stretch goal of calculating the percent change for the intervening years between the start and end date the user entered. I think that would be a cool/useful feature.
  • Simplify the code.
    • I still think there are places I could possibly simplify the code.
  • Add a menu option that allows the user to open a new file. -In its current state, the user would have to exit the program then enter a new file.
Natasha is an MS student in information science at UNC Chapel Hill and a public health analyst in the Center for Communication Science at RTI International. She likes traveling and beer-ing. Find Natasha Vazquez on Twitter, Github, and on the web.