Wednesday, January 18, 2012

Word Frequency Count of Hacker News Front Page on SOPA Blackout Day

When curious as to what people are generally talking about on various social news aggregation websites, I whip up an ad-hoc script to perform a quick analysis every now and then. I thought today was an especially interesting day to look at what all of the HNers are posting. I've filtered out common "words" such as consonants, conjunctions, etc. Here is what the data looks like at 1:43pm today:



8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1

Nothing shocking, but it's clear than SOPA awareness day is working. In fact, I didn't even know what PIPA was until now .

[update]
If you are interested in the code, then I've posted it below. I literally put this together in 10 minutes, so don't expect picture-perfect code:

scrapeAndCount.py script:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://news.ycombinator.com/').read())

print soup.prettify()


wordCount.py script:

import sys, os

words = {}

for i in sys.stdin.readlines():
        for k in i.strip('\n').split(' '):
                try:
                        words[k] += 1
                except:
                        words[k] = 1

for i in words:
        print words[i], i


The pipeline:

python scrapeAndCount.py | grep -A2 "" | grep -B1 "\-\-" | grep -v "\-\-" | sed 's/^\s*//g' | python wordCount.py | sort -k1 -g -r | egrep -i -v "the|and|for|of|than|too|is|\sto$|\sa$|from|goes"



5 Optinalysis: Word Frequency Count of Hacker News Front Page on SOPA Blackout Day When curious as to what people are generally talking about on various social news aggregation websites, I whip up an ad-hoc script to perfor...

No comments:

Post a Comment

< >