Sometimes I interview people on IRC and one of the questions I'm asking is:
Have you ever heard of Austria? If you have, what are the first three words that come to your mind?
Because I have text files (IRC logs) of all these conversations it almost asks for being analysed so we'll use some basic shell scripting to extract the relevant lines from the logs. It's not trivial to automize that because the way the answers are given is in no way uniform. Instead we are just going to grab the whole section and manually clean out the irrelevant answers to the other questions.
Once we have a list of words which looks like this:
schnitzel AKG red, white Vienna German, Europe
We'll just going to use
tr to do some basic text analysis:
cat answers.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn
We are removing the punctuation, replace whitespace with a linebreak, convert everything to lowercase, sort alphabetically, filter out dupes and in the last step we are going to reverse the sort order and display the numeric value (The frequency) of the string.
Sometimes you don't want to split the lines if there's a space between the words; use the following command in that case:
cat answers.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn
The end result should look something like this:
10 vienna 6 europe 5 hitler 3 kangaroo 3 german 3 beer 3 australia 3 alps 2 wien 2 terminator 2 sydney 2 schnitzel 2 mozart 2 mountains 2 country 2 arnold 2 apfelstrudel
If that does pique your interest check out the following link for a more in-depth explanation: