Wednesday, January 18, 2012

Word Frequency Count of Hacker News Front Page on SOPA Blackout Day

When curious as to what people are generally talking about on various social news aggregation websites, I whip up an ad-hoc script to perform a quick analysis every now and then. I thought today was an especially interesting day to look at what all of the HNers are posting. I've filtered out common "words" such as consonants, conjunctions, etc. Here is what the data looks like at 1:43pm today:



8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1

Nothing shocking, but it's clear than SOPA awareness day is working. In fact, I didn't even know what PIPA was until now .

[update]
If you are interested in the code, then I've posted it below. I literally put this together in 10 minutes, so don't expect picture-perfect code:

scrapeAndCount.py script:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://news.ycombinator.com/').read())

print soup.prettify()


wordCount.py script:

import sys, os

words = {}

for i in sys.stdin.readlines():
        for k in i.strip('\n').split(' '):
                try:
                        words[k] += 1
                except:
                        words[k] = 1

for i in words:
        print words[i], i


The pipeline:

python scrapeAndCount.py | grep -A2 "" | grep -B1 "\-\-" | grep -v "\-\-" | sed 's/^\s*//g' | python wordCount.py | sort -k1 -g -r | egrep -i -v "the|and|for|of|than|too|is|\sto$|\sa$|from|goes"



Sunday, January 15, 2012

iOS Development: Types of Applications in XCode, Model-View-Controller (MVC), and Basic Object Interactions

Types of Applications in XCode:


Navigation-based Application - Exactly what it sounds like. Imagine the mail application. We select an option and it navigates to another page (or activity, if you come from the world of Android).


OpenGL ES Application - an application which uses OpenGL. This can be pretty heavy stuff.


Tab Bar Application - An application which is similar to a website with frames. There will be a tab pane at the bottom of the app with 4-5 options and choosing an option will change the activity the user is experiencing.


Utility Application - An interesting kind of application which kind of "flips." These apps tend to have a miniature "i" icon on the bottom right corner of the screen. Upon clicking the icon, the application "flips" into another activity.


View-based Application - Probably the most commonly selected project.


Window-based Application - this sort of project tends to be very bare. I suppose this is the "expert" developer mode where he or she has to pick and choose what components are required for the application.




MVC (Model-View-Controller):


It's just a software design pattern. The idea is to keep the models and the views separate from each other. The philosophy is that a developer should be able to make modifications to one without affecting other. The view never talks to the model. However, the model and the view both talk to the controller.


My little memory trick to remember this paradigm is that: (1) model, view and controller are three good friends and (2) controller just happens to be a control freak and liaisons the group's communication. I'm sure I can think of a better mnemonic but this will do for now. Views are easy to make and are essentially created in the interface builder. Models can best be thought of as "your concept" for the application and I am going to generalize the controller as the code.




Basic Interaction:


After starting a view-based application, we get all of that handy objective-C stub code. The xib files (pronounced "nib") open up in interface builder. We can drag and drop elements (such as text boxes, buttons, text views, etc) into the applications view. 


The interactions between the elements we just selected are defined in the @interface section of the viewController.h file. Pretend we have a text box where a user enters text, a text view to show whatever the user submitted in the text box, and a button to process the action. The text box and text fields would be IBOutlet objects (they are outlets for actions) and the button would be an IBAction object (because we use the button to perform an action).


So let's crack open the header file.


@interface someViewController : UIViewController {
  IBOutlet UITextField *message;
  IBOutlet UILabel *label;
}


We also define @property (nonatomic, retain) for the objects we intend to use (under the @interface section).


@property (nonatomic, retain) IBOutlet UITextField *message;
@property (nonatomic, retain) IBOutlet UILabel *label;


The button requires a method declaration (IBAction).


- (IBAction) lightsCameraAction;


Now we go the the implementation (.m file).


@implementation someViewController


@synthesize message;
@synthesize label;


- (IBAction) lightsCameraAction {
  NSString *internalTemporaryStorage = [[NSString alloc] initWithFormat:@"Your text, %@", message.text];
  label.text = internalTemporaryStorage;
  [internalTemporaryStorage release];
}


The last step is to go back to the interface builder and literally connect the objects to each other. We have to click on the "File's Owner" box and connect the label object to the text view. We have to connect the message object to the text field. And finally, we have to connect the performSomeAction method to the button (and be sure to set the button action to be "touch up inside").

Friday, January 13, 2012

iOS Development: iOS Syntax, Object Allocation, and Garbage Collection

Syntax and Object Allocation:


Coming from the world of Java and Android programming, I'm used to initializing an object in the following format:


Coffee cupOfJoe = new Coffee;


In Objective-C (which is actually a superset of the C language), the same line is written like so:


Coffee *cupOfJoe = [Coffee new];


However, the recommended format is this (which is equivalent to the above):


Coffee *cupOfJoe = [[Coffee alloc] init];


The reason for this is that we have more functionality with this latter method.  There are variations of the init function such as initWithString, etc. This format is also followed for historical reasons (just crack open any supporting libraries to prove it to yourself).




No garbage collection:


Unlike Android (which comes complete with garbage collection since it is programmed in Java), we need to release our objects after we are done using them. 


[cupOfJoe release];


We can also edit the dealloc method in the implementation file. This typically is not recommended. It is inherited from our super class, but we can add to its functionality. 


-(void) dealloc {
    NSLog(@"The world is about to explode!");
    [super dealloc]
}


The NSLog addition will run just before the object will be deleted.


Although I can't confirm yet, I've heard this no garbage collection limitation doesn't exist when creating mac desktop applications-- just iPhone applications.




Autorelease Pools:


Typically, if we own an object, we are responsible for releasing it. However, this isn't always doable.


Sometimes we need an autorelease pool. 


NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];


Autorelease loosely translates to, "Release my object, but not immediately." This is useful for cases like below:


-coolFunction {
  A *a = [[A alloc] init];
  [a autorelease];  //imagine using "release" here like normal
  return a;
}


N *n = [[N alloc] init];
A *coolResult = [n coolFunction];


We have an object allocated inside the function, but it will be released before it can be returned! So instead of using release, we would use autorelease so that the object can be returned. This functionality shouldn't be abused.




As an interesting side note, the "NS" in front of NSObject or NSLog functions stand for "Next Step," from NeXT Computers (the Steve Jobs venture). 

Thursday, January 12, 2012

Exploring Bonferroni Correction (or How Facebook Could Improve Friend Suggestions)



What is Bonferroni?

A statistical function used to find out how often a certain condition would be met just by chance alone. This is very helpful for putting things in perspective when analyzing large data sets (which often have a number of false positive significant results).



Potential Applications:
  • I've always wondered if Facebook cross-references the events people attend to suggest whether or not they should be friends. If this was the only metric they used to suggest friendship, how many friend suggestions would be false positives?
  • How often could potential matches on large dating websites could happen by chance?
  • And of course, from my field, how many single-nucleotide polymorphisms (SNPs) could appear significant in a genome-wide association study by chance alone?



Let’s use the Facebook example and make the following assumptions:
  • There are 1 billion active users
  • Everyone attends an event 1 day in 60
  • There are 500,000 registered events within our scope, which is enough to account for 1 million people who attend an event on a given day.
  • We wade through 1000 days worth of event attendance records



What is the probability that two people were at the same event on two different days?

Assuming everyone randomly attends an event, the probability that someone attends an event on any given day is 0.01 (1/100). And when they do choose an event to attend, they choose one of the 2e+05 registered events at random. Just to be clear on notation, 2e+05 = 2 x 10^5.


  • The probability of any two people both deciding to attend an event on the same day is 0.0001 (1/100*1/100).
  • The probability that they will attend the same event is 0.0001/2e+05 (number of registered events) = 5e-10.
  • The chance that they will attend the same event on two different days is ( 5e-10 )*( 5e-10 )=2.5e-19 (note that the events can be on two different days).

How many of these event attendance coincidences will indicate potential friendship? Let's just say a potential friendship is a pair of people and a pair of days, such that two people were at the same event on each of the two days.

  • The number of pairs of people is (10^9 choose 2) = 5e+17. 
  • The number of pairs of days is (1000 choose 2) = 5e+05.

So, the expected number of attendance coincidences that look like potential friendships = (the numbers of pairs of people)*(the number of pairs of days)*(the probability that any one pair of people and pair of days is an indication of potential friendship).

(5e+17)*(5e+05)*(2.5e-19) = 62,500

This means there will be 62,500 people who look like they will be friends even though they aren't. However, considering that this is on the scale of a billion folks, a 0.00625% false positive rate doesn't seem terrible. 


In my own experience, I know I've seen a number of suggested friends who I have nothing in common with except similar mutual friends. Maybe friend suggestions can be improved by incorporating this information. 


Of course, I deactivated my facebook in 2009 and haven't been back since. Perhaps they already leverage this information...