Sunday, May 12, 2013

The Guerilla Guide to R

Update: Okay. I've uploaded a new template and things seem to be fine now.

Update: I am aware the table of contents is not being displayed in bullet form as I intended. The web template I'm using seems to be buggy. It also seems to think this page is in Indonesian...Working on it!

Table of Contents:
  1. Reading/Writing Files
    1. How to write lines of text into a file
    2. Trimming a huge (3.5GB) csv file to read into R
    3. Quickly reading large dataframes into R
    4. How can I tell if my R dataset is going to be too large?
    5. Standard logging library for R 
  2. Dataframes
    1. How do you write a CSV in R with matrix names dimnames(M) intact?
    2. How to join dataframes in R (inner, outer, left, right)?
    3. Dropping Columns in Dataframes 
    4. Dropping factor levels in a subsetted dataframe
    5. Remove rows with NA in your dataframe 
    6. Creating an R dataframe row by row 
    7. Fastest way to merge/join dataframes
  3. Lists and Vectors
    1. How to correctly use lists in R? 
    2. Levels - what sorcery is this? 
    3. R function for testing if a vector contains a given element?
    4. Converting lists to dataframe 
    5. In R, what is the difference between [] and [[]] notations accessing the elements of a list 
    6. How to convert a factor to an integer/numeric without a loss of information 
    7. How to access the last value of a vector 
    8. Removing an element from a list 
    9. Extracting the last n characters from a string 
  4. Exceptions and Gotchas
    1. What is the biggest R gotcha you've run across?
    2. Exception handling in R
    3. Reading commandline parameters from an R script
    4. Assignment operators in R: '=' vs '<- a=""> 
    5. Why is not (explicitly) calling return faster or better, and thus preferable? 
    6. Why does TRUE == "TRUE" in R?
    7. What is the difference between '1L' and '1'?  
    8. Why does the number 1e99999 (31 9's) cause problems? 
    9. Examples of the perils of globals in R and Stata 
    10. How to count TRUE values in a logical vector 
  5. Sorting
    1. Letter "y" comes after "i" when sorting?
    2. How to sort a dataframe by column(s) in R?
    3. Fastest way to find the second (or third) highest/lowest value in a vector or column 
  6. Plotting
    1. Plotting 2 graphs in the same plot
    2. How to plot 2 histograms together in R
    3. Histogram with a logarithmic scale 
    4. The most underutilized visualization
    5. Rotating and spacing axis labels in ggplot2 
    6. Shading a kernal density plot between 2 points
    7. Choosing between qplot() and ggplot() in ggplot2 
    8. Intelligent point label placement in R 
    9. Plotting a correlation matrix 
    10. Plotting a 3D surface plot with contour map overlay, using R 
    11. Side-by-side plots using ggplot2 
    12. Ggplot2 cheat sheet 
    13. Plot a human body in 2D 
    14. Plotting 2 variables as lines using ggplot2 
    15. Getting rid of axis values in an R plot 
  7. Grouping Functions and Speed
    1. R grouping functions: sapply vs lapply vs apply vs tapply vs etc. 
    2. Why is '[' better than the subset function?
    3. Confused by ...() 
    4. Speeding up "group by" functions
    5. Options for caching/memoization/hashing in R 
  8. Random
    1. Making XKCD style plots in R 
    2. Developing geographic thematic maps with R
    3. What can Matlab do that R cannot do?
    4. Where to learn C code to speed up your R functions
    5. How to organize large R programs
    6. The difference between library() and require() 
    7. Unloading a package without restarting R
    8. What is the difference between R.exe, Rterm.exe, Rscript.exe, and Rcmd.exe?
    9. What is your preferred style for naming variables in R? 
    10. In R, what exactly is the problem with having objects with the same base name as functions?  
    11. Easter eggs in R 
    12. Drawing an excellent cow  
    13. Display a time clock on the R commandline 
    14. What are slots?  
  9. Case Studies
    1. How to determine the position of the sun at a given time of day, latitude and longitude 
    2. Implementing Model-View-Control (MVC) in R
    3. Scraping html tables into R dataframes 
    4. R + ggplot: time series with events 
    5. Speed up loop operation in R
    6. How to scrape the web for the list of R release dates

About:

Stack Overflow is awesome. Some of the worlds most brilliant programmers frequent the website and answer tough questions. Wouldn't that make it a great place to learn from? Yeah, I think so too.

This is why I've collated, The Guerilla Cookbook for R. It's basically a number of Stack Overflow links organized and ordered in a way to help R programmers learn their way to the next level. If you are proficient in R, I hope these resources will help you get closer to being amazing. If you are just getting started with R, I'd suggest adding this page to your bookmarks and returning when you are familiar with the basics of R programming.

The cool thing is, this "book" essentially writes itself since most of the experts (and peer-reviewers) are answering the questions. Most of the questions are "real-world" and are asked by novice or intermediate programmers. We can easily add/remove/reorganize the contents as necessary.


How Was The Content Selected?:

I personally searched through Stack Overflow to find my favorite questions and shared them here. 

The table of contents will have to be updated/reorganized over time as links are added and removed. But use whatever you can for now!
 

Wednesday, December 26, 2012

OWLify: A Python Library to Generate RDF/OWL Code


Why Another Library?:

I've been working with RDF/OWL data over the past few months. I've been moving away from high-performance databases and NoSQL and towards the semantic web. The semantic web allows us to represent rich, complex relationships between data--in ways we just aren't able to with current database technology. I've found RDF/OWL better suited for representing knowledge of the data I've been working with most recently.

Libraries That Already Exist:

We can query semantic web files with SPARQL. A Python library used to interface with SPARQL already exists. In fact, there is also a library that exists to help process RDF/OWL files in Python. However, I couldn't find anything well-suited to take a python object and transform that into RDF/OWL. So I wrote OWLify.

Explain OWLify:

In order to quickly turn my data into a RDF/OWL file, I wrote a bare minimal library I called OWLify. I only included the functions I needed at the time. I'm putting it up on github for folks to fork and extend as needed. The interface is pretty simple. Creating new RDF-compliant functions should be very straightforward.

Example Code:

from OWLify import OWL

names = ['Janet', 'Marcy', 'Ed']
gender = ['Female', 'Female', 'Male']
colors = ['Red','Blue','Green']

out = OWL('http://www.nikhilgopal.com', 'owl_example.owl', 'http://www.nikhilgopal.com/properties')
out.start()
for n in range(len(names)):
    out.addClass(names[n])
    out.assertEquivalentTriple(names[n], 'hasColor', colors[n])
    out.assertSubClassTriple(names[n], 'hasGender', gender[n])
out.end()

Get the Code



Friday, August 10, 2012

The Bunch Design Pattern (Python)

This one is out of the Python Cookbook and credit goes to Alex Martelli if I'm not mistaken. This technique is useful when we would like to record data in some type of "forgiving" data structure.

We know we can accomplish this with a python dictionary, but what about when we want to nest an inconsistent data structure in another inconsistent data structure? This is when the bunch pattern is useful.

Example Program:


class Bunch:
def __init__(self, *args, **kwargs):
self.__dict__.update(kwargs)

struct = Bunch(type="flat", size="huge", family="chordata", genus=Bunch(level="medium", intensity="hot"), BOOL=True)

print struct
print struct.type
print struct.size
print struct.genus.level

Output:


<__main__ .bunch=".bunch" 0x1004e1248="0x1004e1248" at="at" instance="instance">
flat
huge
medium



Pretty slick, huh? The beauty of this vignette is that it is highly extensible. The original thread where I stumbled upon this is located here.


Tuesday, April 24, 2012

A Refresher on "Big O" Notation (Python)

It's well known that asymptotic notation is used to convey the speed and efficiency of code blocks in computer programs. I haven't used them very much while working with Python, so I needed to refresh my memory before trying to use this great tool.


Cardinal Rule: Focus primarily the largest value in the equation of time complexity. All other factors in the time complexity equation are essentially trumped.


O(n^4+n^2+n^3+nm+100) ~= O(n^4)
Update: assuming m is linear.


Trump Rules for Time Complexity:

  • Notes
    • Remember that we care most about the upper bound and are not so concerned with the lower (in general)
    • The smaller the upper bound number the better (and consequently, faster)
  • The Ladder
    • Constants are less than logarithms
    • Logarithms are less than polynomials
    • Polynomials are less than exponentials
    • Exponentials are less than factorials

Notation and Hierarchy (Smaller Is Better):

Constant ฮ˜(1)
Logarithmic ฮ˜(lg n) 
Linear ฮ˜(n) 
Loglinear ฮ˜(n lg n) 
Quadraticฮ˜(n^2
Cubic ฮ˜(n^3
Polynomial O(n^k
Exponential O(k^n
Factorial ฮ˜(n!)

Quick Examples:

[i for i in list] {linear}
Functions that generally operate on lists or generators (sum, map, filter, reduce, min, max, etc) tend to be linear in time complexity

[i+k for i in list for k in list] {quadratic}
[i+k for i in list1 for k in list2] O(list1*list2) {quadratic I think, since it's linear*linear}
Add 1 to the exponent value for each nested loop. For example. [j+i+k+n for j in list1 for i in list1 for k in list1 for n in list1] would have a time complexity of O(n^4)

Note: Some programmers reduce quadratic time complexity a bit when using nested loops with sorted lists by ensuring that calculations aren't performed more than once. Consequently, that code block runs faster and faster and less and less has to be evaluated through each iteration of the loop. For example:

list1 = [i for i in range(10)]
size = len(list1)
number=100
for n in range(size-1):
    for k in range(n+1, size):
        print number*(n+k)


Wednesday, April 4, 2012

How to Improve Coding Style (In Python)

I'm finally cracking open the Python Style Guide. I've been programming python for years now so I thought I'd join the club.


In addition to all of the nifty tools available to speed up and optimize python code, there are a few utilities out there to help with coding style. PyLint is a program which analyzes source code and reports lines which do not follow the PEP 8 coding convention. There is another program called CloneDigger which looks through source code and points out duplicate code.


Summary of Guidelines to Improve Python Coding Style:

  • Variables
    • Global variables should be ALL_CAPS_WITH_UNDERSCORES
    • Non-public variables within classes should be prefixed with an underscore and lowercase (_private_list = [])
    • Public variables should be lowercase
    • Boolean variables should be have "is" or "has" (is_full = True)
    • Avoid generic names
  • Classes
    • Should be named using the CamelCase convention
    • If a class will be a base class, prefix the classname with "Base"
  • Functions
    • Names of functions and methods should be lowercase and underscore separated (do_something_with_this)
    • Watch out for custom functions which share names with built-in functions
      • If this does happen and one can't find a better name, then add a trailing underscore to the custom function
    • Arguments names and contents should be decided through an iterative design process
      • Also, do not use spaces around the "=" sign used to assign the default parameter for keyword arguments
    • Be careful with *args and **kw. These can cause problems if abused.
    • Don't implement "type" checking using the assert command
  • Conditionals
    • When check to see if an object is true, use "if object:" rather than "if object == True" or "if object is True"
    • When checking to see if an object is of a certain "type" (integer, string, etc), do not use "if type(obj) == type(int):", use "if isinstance(obj, int):"
    • What's "False" in Python?:
      • None
      • False
      • Zero (of any numeric type)
      • Any empty sequence or mapping ({},'', [], ())
      • instances of user-defined classes, if the class defines a __nonzero__() or __len__() method, when that method returns the integer zero or bool value False
  • Modules and Packages
    • Usually has a lib suffix (mathlib)
    • Be careful to ensure that any process requiring the use of several functions strung together are consolidated into an independent "pipeline" function
  • General Coding
    • Clarity over cleverness
    • Try to keep code as short as possible--short enough to fit on the screen without having to scroll rampantly
    • When a class start to have about 10 or more methods, it is time to re-evaluate the contents of the class and possibly split that larger class into a number of smaller classes
    • Use lambda functions for functions that will only be run once or twice. Otherwise, create a defined function
    • Use spaces around arithmetic operators
    • Import statements should always be on separate lines
    • Avoid extraneous whitespace immediately inside parentheses, brackets, or braces. Also before colons

Tuesday, April 3, 2012

Playing with Ruby on Rails

Here below are some notes I made while playing with Ruby and RoR. Interestingly, I found that RoR is a great way to learn about the foundations of web frameworks. I have a much stronger understanding about not just RoR, but of Django, Pyramid, and Web2Py. Every web framework really has the same few components (or some variation thereof): Models, Views, Controllers, and Routes.

I walked through the free tutorial by CodeSchool:

The Basics:

CRUD
Create,Read,Update,Destroy

Zombie.new
z = Zombie.new(initialize)
z.save
Zombie.create
Zombie.find(3)
Zombie.update_attributes

Models:

class Tweet < ActiveRecord::Base (just means that class Tweet inherits from ActiveRecord)
end

validate data before it is saved in DB

class Tweet < ActiveRecord::Base
    validates_presence_of :status
end

t.errors -> returns errors
t.errors[:status] --> just error pertaining to status

Rails 3 new syntax for validation:

validates attribute, validation
validates :status, :presence => true
validates :status, :length => {:minimum => 3}

app/models/tweet.rb

class Tweet < ActiveRecord::Base
    belongs_to :zombie
end

app/models/zombie.rb
class Zombie < ActiveRecord::Base
    has_many :tweets
end

View:

web request -> 4 layers -> models, view, controllers, routing

<%...%> evaluate ruby
<%=...%> evaluate and print results

layouts/application --> make html page format and use <%= yield %>
/app/views/tweets/shot.html.erb (code that will be yielded is in here)

Adding CSS
<%= stylesheet_link_tag :all %>
<%= javascript_include_tag :defaults %> # can replace prototype javascript library with jquery
<%= csrf_meta_tag %> #protects website from hackers

with URLs, it checks public folder first, then tries to execute inside rails

Adding a Link
<%= link_to "link text",  "link path (URL)"%>
<%= link_to tweet.zombie.name, zombie_path(zombie.tweet) %>
<%= link_to "Edit", edit_tweet_path(tweet) %>
<%= link_to "Delete", :method => :delete %>

Listing Zombies

Listing Tweets


<% Tweet.all.each do |tweet| %>
<% end %>
StatusZombie
<% tweet.status %><% tweet.zombie.name %>

Controllers:

class TweetsController < ApplicationController
   def show
      @tweet = Tweet.find(params[:id]) #using an instance variable
       render :action => 'status' #for status.html.erb
       respond_to do |format|
          format.html #show html.erb
          format.xml  { render :xml => @tweet }
          format.json { render :json => @tweet }
   end
end

model calls goes into the controller files. All request data stored in a hash called params.

def index -> list all tweets
def show -> show a single tweet
def new -> show a new tweet form
def edit -> show an edit tweet form
def create -> create a new tweet
def update -> update a tweet
def delete -> delete a tweet


Authorization:
def edit
  @tweet = Tweet.find(params[:id])
   if session[:zombie_id] != @tweet.zombie_id
      flash[:notice] = "sorry, you can't edit this tweet"
      redirect_to(tweets_path)
    end
  end
 end

class TweetsController < ApplicationController
  before_filter :get_tweet, :only => [:edit, :update, :destroy]
  before_filter :check_auth, :only => [:edit, :update, :destroy]

  def get_tweet
    @tweet = Tweet.find(params[:id])
  end

def check_auth
  if session[:zombie_id] != @tweet.zombie_id
      flash[:notice] = "sorry, you can't edit this tweet"
      redirect_to(tweets_path)
  end
 end


Routing:

config/routes.rb
ZombieTwitter::Application.routes.draw do |map|
  resources :tweets #creates a "REST"ful resource
  match 'new_tweet' => "Tweets#new" # path => controller#action
  match 'all' => "Tweets#index", :as => "all_tweets" # now all tweets can be used as an object "link_to"
  match 'a' => redirect('/tweets')
  match 'google' => redirect('http://www.google.com')
  root :to => "Tweets#index"
  match 'local_tweets/:zipcode' => "Tweets#index"
  match 'local_tweets/:zipcode' => 'Tweets#index', :as => 'local_tweets'
end

<%= link_to "All Tweets", all_tweets_path %>
<%= link_to "Tweets in 32828", local_tweets_path(32828) %>


Wednesday, January 18, 2012

Word Frequency Count of Hacker News Front Page on SOPA Blackout Day

When curious as to what people are generally talking about on various social news aggregation websites, I whip up an ad-hoc script to perform a quick analysis every now and then. I thought today was an especially interesting day to look at what all of the HNers are posting. I've filtered out common "words" such as consonants, conjunctions, etc. Here is what the data looks like at 1:43pm today:



8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1

Nothing shocking, but it's clear than SOPA awareness day is working. In fact, I didn't even know what PIPA was until now .

[update]
If you are interested in the code, then I've posted it below. I literally put this together in 10 minutes, so don't expect picture-perfect code:

scrapeAndCount.py script:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://news.ycombinator.com/').read())

print soup.prettify()


wordCount.py script:

import sys, os

words = {}

for i in sys.stdin.readlines():
        for k in i.strip('\n').split(' '):
                try:
                        words[k] += 1
                except:
                        words[k] = 1

for i in words:
        print words[i], i


The pipeline:

python scrapeAndCount.py | grep -A2 "" | grep -B1 "\-\-" | grep -v "\-\-" | sed 's/^\s*//g' | python wordCount.py | sort -k1 -g -r | egrep -i -v "the|and|for|of|than|too|is|\sto$|\sa$|from|goes"