Erica's Final Project

by Erica Brody

29 Apr 2016

My project can be found at this GitHub repo: https://github.com/ericabrody/data_explore.git

Introduction

I chose to create a data analysis program, because I think it is much more likely that I will use Python for data analysis than video game creation in the future. At the beginning, I was not feeling particularly confident in my ability to create such a program; however, I committed myself to breaking the task down into small chunks and focusing on one chunk at a time. So while I was least confident, I just focused on selecting a dataset and getting it into Trinket.

Data Selection

Initially, I viewed this project as an opportunity to look at some “fun” data about an unfamiliar topic; however, I ultimately chose a familiar dataset that I had used extensively. I spent a bunch of time considering a variety of datasets using kaggle.com and The National Bureau of Economic Research (http://www.nber.org/data/), and a catalog of public use datasets about health from the government, https://catalog.data.gov/dataset?_organization_limit=0&organization=hhs-gov#topic=health_navigation. The website for data.gov was really fantastic, because I was able to search the many available datasets by a variety of facets, including data format and topic. While reviewing this repository was fun, I struggled to decide on a single dataset. So, I ended up deciding to use data from the Behavioral Risk Factor Surveillance Survey – BRFSS (http://www.cdc.gov/brfss/annual_data/annual_2014.html), reasoning that it might be helpful to use familiar data for a project where I was feeling uncertain about my ability to create the needed functionality.

Data Pre-processing

Pre-processing the BRFSS data involved a lot more time and work than I expected because of the formats the data was available in and the file was quite large. The data are available in fixed width and SAS transport files. I considered using the fixed width format, but, using my resources, my husband indicated that using fixed width data is very tricky and prone to error. I have experience with SAS, but not the transport files and I had trouble figuring out how to use SAS on the UNC virtual server. So, again using my resources, I talked to a friend at Odum Institute who is a SAS programmer, and he helped me transform the 2014 and 2013 datafiles into csv format. But the resulting dataset was huge. I didn’t want to go back to my SAS friend, and I realized that I could use the desktop version of SAS on one of the Davis Library computers to pull out a small subset of variables and cases from only Connecticut and North Carolina – the two states where I have lived the longest and states that are likely to have different health risk profiles.

Unfortunately, the file was still too big to use with Trinket, so after investing so much effort in getting the file in its current form, I decided to switch to Cloud 9 so I could move on from the pre-processing. I used the github and Cloud 9 information in the class notes and a little help from Eric to set up my new github repo, which was remarkably easy, and to connect it to Cloud 9 and start working with Python.

User interface functionality

I started building the program without having a clear vision of the user interface functionality. Since I have a bunch of experience with data analysis, my initial thoughts for the program were quite ambitious. I thought that the user might need to do data cleaning to remove cases with missing data, and refused or don’t know responses. The user would want to see frequency tables of all the variables to figure this out. Then, somehow my program would need to have the functionality to remove cases based on the user requirements. This seemed very complicated and would result in a laborious experience for users, since it would require a lot of back and forth between the user and the program before the user would get to see what they want to see.

On the day before the second project update, I focused on the user interface and realized that it could be MUCH simpler than I had envisioned, since I didn’t need to try to recreate a data analysis program like SAS. I just needed to be able to show users some statistics. The resulting interface is perhaps a bit more simplistic than ideal, really limiting users as to what they can do and see in the data; however, my new plan felt do-able and my confidence in my ability to deliver a working product increased.

Python Programming

I started building the program in one Python module (main.py) and after I had a few functions, the program was getting messy, so I added funct.py to contain the functions and helpers.py to contain a data dictionary. I think this helped my thinking process on the programming, since it sort of cleared things up like working on a clean (vs. cluttered) desk.

I started out by reading in the 2014 dataset as a list of lists, because this is the data structure I have used for many years. However, when I was thinking about how to bring in multiple datasets and accommodate a “bring your data” option, I was concerned about figuring out how it would work. Also, I noticed that the order of the variables in my 2013 and 2014 datasets were different, and so the order of variables in a “bring your own” dataset could also be different, and thus require tedious back and forth with the user about which column each variable is in. Then, I remembered that dictionaries are not in any particular order, and I had started reading the documentation for the csv module in Python, since I was using a csv file and was curious about the functionality available. I found the DictReader functionality and it seemed like an easy way to create a dataset of a dictionary of dictionaries with the variable names as keys.

To prepare to use my dictionary of dictionaries data structure and build my confidence with this aspect of Python, I reviewed the dictionary chapter in our text, dictionary exercises we did in class and dictionary information in the book, Getting Started with Python, which introduced the get function of dictionaries which I found very helpful.

I built the data analysis functionality first since I thought these parts would be the hardest as I had less experience using Python for data analysis, and I assumed that the programming for the user interface would be easier since I had done that before in my turtle game and numerous homework assignments.

The first analysis milestone I tackled was to get a frequency distribution of a single variable and print out a table and histogram of the results. Working on these three functions helped me create the process I would use to build all of the data analysis functions in Python. Using pen and paper, I drew out an example of the structure of the data that I was working with and a sketch of what I was looking to get out of the function, i.e., the data structure format of a “result” dataset, a nicely formatted printout of the data, or a visualization of the data (e.g, histogram). This helped me really focus on what I needed to do. In addition, I got each function working with specific dataset and variable(s) first and then I used those lines of code to create a function that would work with any of the variables in the dataset. I saved some of the code that I wrote to work with specific variables in “earlycode.py” for your reference.

Having such a methodical approach to the programming was really helpful. It allowed me to break the coding process down into really small bits that I could time box effectively. Also, this process helped me feel more comfortable with the programming tasks. I spent a lot of time getting the first three functions to work, with a lot of breaks to clear my head, but these served as a good foundation for the rest of the work. When I would get very frustrated with a problem or just not know how to start, I would search Python documentation on the internet to try to figure it out. For example, I googled “Python sorting” and google had an auto-complete option “Python sorting dictionary by value,” which led me to this page (https://docs.python.org/3/howto/sorting.html). I had to read the documentation a few different times and I ended up using some code that used “lambda” and I wasn’t totally sure how lambda worked, but it helped me print my frequency table the way I wanted with the response values in order of response value, rather than the highest response value first. Once, when I got totally frustrated trying to format the output for the univariate statistics of the continuous variables, Eric recommended that I look up Python format strings and I found some helpful resources – https://pyformat.info and http://www.python-course.eu/python3_formatted_output.php.

I also had to do a lot of trial and error. Every time I got some code together to run, I would get at least one error – a missing colon or something and often bigger errors. I appreciated that Cloud 9 is fast, so I could keep rerunning the programs with small changes. Throughout the process, my ability to figure out what the errors meant or at least how to troubleshoot got better. I printed a lot of ugly output to find mistakes.

I think the most valuable problem-solving approach I used was to walk away from the project when I got stuck. Surprisingly, I had to walk away from both the computer and my paper and pencil work. In lieu of the E. Hauser shower approach, I took a lot of little naps. Or tried to, once I laid down and closed my eyes, my brain would keep working on the problem and frequently I would get up after about 10 minutes with an idea of how to fix the code. Even if that initial idea didn’t work, it would get me one step closer to the solution.

Doing all the hard work first made the last part of the coding blissfully easy, i.e., pulling all the functions together in the user interface. It felt pretty amazing when I hooked up all the functions to the menus and the program worked, even when Eric tested it and tried to break it and produce an error.

I am a library science student, focused on health information. I am dabbling in information science this semester. Find Erica Brody on Twitter, Github, and on the web.