Digg.com Ubuntu popular headline analysis

I was curious what the most popular keywords were in the Ubuntu headlines, since it seemed like some of them seemed identical.
So I saved the top 10 pages of results for the search term Ubuntu, sorted by Most Diggs.
With all of the pages in a directory, I cut out the headlines and stripped the HTML with the following command:

$ cat *.html|grep news-body|sed -e 's/<[^<>]*>//g' > diggubuntuheadlines.txt

Now I have a list of each headline. Unfortunately, though, this also returns headlines from articles that just mention Ubuntu, so I killed the lines that didn’t have Ubuntu.

$ grep -i ubuntu diggubuntuheadlines.txt > diggubuntuheadlines2.txt

Now I want to pull out a list of unique words in the file, the number of occurences of each word, sorted by the most occurences descending. Thanks to this short perl script posted by planetscape, I have a solution.

I paste the contents into a file, change the first line to read /usr/bin/perl, save it, then chmod +x the file.

Next I pipe the contents of the file into the script, and save the output.

$ cat diggubuntuheadlines2.txt | ./countwords.pl > diggheadlinecount.txt

Well, I guess that’s enough foreplay, what’s the verdict?

117 ubuntu
25 to
22 linux
20 windows
19 a
14 in
14 dell
12 with
12 on
12 for
11 the
9 and
8 install
7 vista
7 of
7 how
6 your
6 you
6 from
5 released
5 pcs
5 out
5 new
5 is
5 guide
5 feisty
5 by
4 without
4 what
4 users
4 than
4 s
4 has
4 free
4 best
3 xp
3 video
3 ultimate
3 time
3 switching
3 should
3 running
3 run
3 over
3 os
3 official
3 mythtv
3 more
3 microsoft
3 media
3 logo
3 like
3 know
3 installing
3 get
3 fawn
3 fast
3 edition
3 edgy
3 dock
3 boot
3 based
3 as
3 anything
3 about
2 x
2 world
2 will
2 way
2 vs
2 vote
2 using
2 up
2 tutorial
2 top
2 this
2 there
2 t
2 support
2 studio
2 stickers
2 side
2 shuttleworth
2 review
2 read
2 powered
2 pic
2 pc
2 password
2 osx
2 online
2 one
2 officially
2 now
2 need
2 multimedia
2 mount
2 mce
2 mark
2 make
2 magazine
2 looks
2 look
2 laptop
2 it
2 installed
2 gifting
2 full
2 eye
2 ever
2 dual
2 distribution
2 desktop
2 days
2 core
2 completely
2 compiz
2 cheap
2 center
2 cd
2 candy
2 breezy
2 box
2 books
2 beryl
2 be
2 are
2 applications
2 almost
1 year
1 xps
1 xorg
1 xgl
1 write
1 writabable
1 wpics
1 would
1 working
1 wireless
1 winxp
1 wins
1 wine
1 why
1 whole
1 while
1 wga
1 wep
1 welcome
1 web
1 weapons
1 we
1 was
1 warranty
1 warcraft
1 want
1 wall
1 voted
1 vmware
1 virus
1 victorious
1 versus
1 validates
1 uses
1 user
1 useful
1 us
1 unmount
1 ui
1 ugly
1 tweaks
1 tweaking
1 tutorials
1 try
1 truth
1 triple
1 tricks
1 transparent
1 transform
1 today
1 tips
1 tier
1 thursday
1 thinks
1 things
1 their
1 ten
1 technical
1 tad
1 system
1 switches
1 switch
1 supported
1 super
1 sun
1 strip
1 story
1 still
1 sticker
1 steps
1 stable
1 squad
1 spread
1 spotted
1 spiffing
1 software
1 smoke
1 single
1 simple
1 shrink
1 shirt
1 shift
1 shell
1 server
1 searched
1 seamless
1 screwup
1 screenshots
1 screen
1 satanic
1 root
1 rom
1 rising
1 right
1 reviewit
1 repository
1 reported
1 release
1 redesign
1 really
1 readable
1 ran
1 ram
1 quietly
1 purchase
1 progress
1 products
1 preview
1 prettier
1 preinstalled
1 prebuilt
1 pre
1 posters
1 possibly
1 popularity
1 popular
1 pm
1 player
1 picture
1 physics
1 photoshop
1 performance
1 perfectly
1 partition
1 part
1 parliament
1 or
1 onto
1 office
1 offers
1 offering
1 ntfs
1 nrg
1 notebooks
1 not
1 non
1 next
1 network
1 n
1 mod
1 million
1 might
1 mdf
1 mcgee
1 mcdonalds
1 marketplace
1 manufacturers
1 makes
1 macbook
1 mac
1 looking
1 links
1 lifehacker
1 life
1 less
1 just
1 issue
1 iso
1 introducing
1 internet
1 interface
1 instlux
1 installer
1 installation
1 insane
1 inaccurate
1 impressed
1 immediately
1 images
1 image
1 if
1 i
1 hungry
1 howto
1 house
1 hours
1 hot
1 holy
1 hippo
1 heron
1 hell
1 hardy
1 happen
1 guy
1 gui
1 growing
1 great
1 gnu
1 gnome
1 glass
1 girl
1 getting
1 gets
1 genuine
1 fusion
1 french
1 forces
1 followup
1 fixed
1 first
1 firefox
1 finally
1 few
1 father
1 faster
1 fantastic
1 extended
1 explains
1 explained
1 expensive
1 expect
1 existing
1 excellent
1 exactly
1 everything
1 everyone
1 engine
1 embargo
1 eft
1 easyubuntu
1 easy
1 easier
1 dvddecrypter
1 dvd
1 dualview
1 drops
1 drivers
1 download
1 door
1 doesn
1 does
1 do
1 disturbing
1 distributing
1 dismissed
1 diggers
1 demo
1 debian
1 customs
1 customization
1 cst
1 cs
1 cracking
1 could
1 converts
1 controls
1 confirmed
1 conf
1 computers
1 complete
1 comparison
1 community
1 commercial
1 coming
1 com
1 colors
1 click
1 cleartext
1 cleaning
1 circle
1 choose
1 card
1 canonical
1 building
1 build
1 bug
1 booting
1 black
1 bittorrent
1 billboard
1 better
1 been
1 beautiful
1 basics
1 badger
1 awesome
1 award
1 available
1 at
1 artwork
1 arrives
1 arrived
1 april
1 apps
1 any
1 an
1 american
1 amd
1 amazing
1 alumni
1 after
1 advantages
1 administrator

No surprises here, but it may be helpful when you go to write your next Digg headline. :)

Until next time

-LightningCrash

Leave a Reply

You must be logged in to post a comment.