8 SOPA
3 protest
3 PIPA
2 Wikipedia
2 white
2 Web
2 Today
2 "support"
2 SOPA/PIPA
2 out
2 million
2 I'm
2 blackout
2 /
1 Your
1 WTF
1 WSJ
1 With
1 with
1 Wired
1 Windows
1 Why
1 What
1 Were
1 website
1 We
1 Version
1 users-per-employee
1 tweets
1 Today's
1 today
1 Timeline
1 Tech
1 support
1 stop
1 Stack
1 "spoilering")
1 SOPA/PIPA:
1 Social
1 slower
1 Service
1 SEO:
1 SEO
1 Senator
1 Scalable
1 Saying
1 Rubio
1 Representatives
1 Reddit
1 Real-time
1 ratio
1 rate
1 Protests,
1 Problem
1 Pirating
1 Pirate
1 PIPA.
1 PC
1 Page
1 page
1 Out
1 out"
1 our
1 (or,
1 OCaml
1 (NYTM
1 NoSQL
1 No
1 Next
1 New
1 Mozilla
1 More
1 Modern
1 Masses
1 Marco
1 Mainstream
1 Magazine
1 lower
1 Life
1 I
1 hr
1 hours
1 Help
1 Have
1 has
1 .gov
1 Google's
1 Googlebot
1 gone
1 Fourier
1 Flickr
1 feedback)
1 Fax
1 Fast
1 FARK
1 Fall
1 Facebook
1 Experiments
1 DynamoDB
1 drops
1 down
1 Dear
1 Database
1 Daily
1 creative
1 crawl
1 Congressman
1 Community,
1 Communication
1 comments
1 comes
1 Building
1 Breaking
1 blacksout
1 Blacks
1 blacks
1 "blacking
1 Becomes
1 be
1 Barriers
1 AWS
1 awful
1 at
1 Assholes
1 as
1 Anti-SOPA
1 an
1 Amazon
1 all
1 Accounts
1 accident
1 About
1 7
1 4chan
1 250k
1 12
1 10.5
1 รข
1 &
1
Nothing shocking, but it's clear than SOPA awareness day is working. In fact, I didn't even know what PIPA was until now .
[update]
If you are interested in the code, then I've posted it below. I literally put this together in 10 minutes, so don't expect picture-perfect code:
scrapeAndCount.py script:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://news.ycombinator.com/').read())
print soup.prettify()
wordCount.py script:
import sys, os
words = {}
for i in sys.stdin.readlines():
for k in i.strip('\n').split(' '):
try:
words[k] += 1
except:
words[k] = 1
for i in words:
print words[i], i
The pipeline:
python scrapeAndCount.py | grep -A2 "" | grep -B1 "\-\-" | grep -v "\-\-" | sed 's/^\s*//g' | python wordCount.py | sort -k1 -g -r | egrep -i -v "the|and|for|of|than|too|is|\sto$|\sa$|from|goes"
No comments:
Post a Comment