When curious as to what people are generally talking about on various social news aggregation websites, I whip up an ad-hoc script to perform a quick analysis every now and then. I thought today was an especially interesting day to look at what all of the HNers are posting. I've filtered out common "words" such as consonants, conjunctions, etc. Here is what the data looks like at 1:43pm today:
8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1
8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1
Nothing shocking, but it's clear than SOPA awareness day is working. In fact, I didn't even know what PIPA was until now .
[update]
If you are interested in the code, then I've posted it below. I literally put this together in 10 minutes, so don't expect picture-perfect code:
scrapeAndCount.py script:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://news.ycombinator.com/').read())
print soup.prettify()
wordCount.py script:
import sys, os
words = {}
for i in sys.stdin.readlines():
for k in i.strip('\n').split(' '):
try:
words[k] += 1
except:
words[k] = 1
for i in words:
print words[i], i
The pipeline:
python scrapeAndCount.py | grep -A2 "" | grep -B1 "\-\-" | grep -v "\-\-" | sed 's/^\s*//g' | python wordCount.py | sort -k1 -g -r | egrep -i -v "the|and|for|of|than|too|is|\sto$|\sa$|from|goes"


