Friday, May 20, 2011

How to Mine 23andMe Data: Part 3

A Note on Choice of Language:


I'm going to cheat a little bit. Taking my own advice from my post, "Bioinformatics Programming Like Experts," I've found it much simpler to answer my next few questions using R. R has a number of complicated statistical tests built-in -- performing them on data is trivial.


What I've Done: Principal Component Analysis:


I've performed principal component analysis on my family's 23andMe data. In a nutshell, principal component analysis transforms multi-dimensional data into a number of components which reflect the amount of variance in each dimension. Thus, the first principal component corresponds to the dimension which accounts for most of the variation in a dataset and the last principal component corresponds to the dimension which accounts for the least variation in a dataset. The process of obtaining these numbers is very involved. 


What Are We Looking At?:


To get to the punchline and share what I've posted in simple terms, we can plot "principal component 1" values against "principal component 2" values and the similar data points will "cluster."






Since I didn't include a legend in the image above, here is who each data point corresponds to (and rough coordinates for folks who can't see color too well):

  • Red = Me (-700, 1000)
  • Purple = Sister (-550, 100)
  • Pink = Mom (-1750, -1000)
  • Green = Dad (1300, 1000)
  • Blue = Grandfather (1700, -1500)
Brief Explanation of the Image:

My sister and I are roughly half of my mother and father. So, our data points straddle between the mom and dad data points. Based on the location of the grandpa data point, its obvious whether grandpa is paternal or maternal.

Monday, May 16, 2011

Bioinformatics Programming Like Experts

When I (formally) learned how to program at university, I was taught a number of "rules." One of these rules was to pick one language of implementation for the program I am working on and stick with it! If you start programming in C#, then you'd better stick with C# all the way through!


Recently, I've asked myself if this rule is worth breaking (and when it is appropriate to do so). This rule is just something I originally took as a tenet and didn't bother to question.


It occurred to me the other day that every language has its strengths and weaknesses. For example, I find R is probably the best language to use when it comes to matrix operations and statistical tests (or Matlab). Python is very quick to program and easy to understand. C/C++ is great to program in if speed is important (or Java if one is so inclined). Bash is great for file manipulations, parsing, and match operations. MySQL is the standard for databases.


Picking a language and sticking with it for the implementation of a program will likely make it efficient for one type of function and much less so for the others. With all of the great languages out there, it would be a shame to not leverage them and their strengths. 


On the other hand, it will also be easier to share the program if it is written in one language and does not have a slew of dependencies. 


Essentially, we're trading ease-of-use for high efficiency.


How I use programming languages for modern bioinformatics:
Python - application development, proof-of-concepts, complicated scripts
C/C++/Java - programming (slow) functions to port to my python code, speeding up the code in the big picture
Bash - parsing and file manipulation operations, simple scripts, one-offs
R - data visualization and statistical tests
MySQL - filtering and manipulating data in a database


It is important to have a handle on a language in each category. Many of my heavier pipelines contain at least three or more of the languages above (and sometimes more than once).


Just for the record, I don't have anything against perl. I've just found myself drifting farther and farther away from it over the years.