# Notes for # Filtering with UNIX Filter Utilities # ==================================== # For: talk at Oakland Perl Mongers Meeting # Date: Tuesday 2002-12-10 # By: George Woolley # # 0. Overview # ----------- # Metaphors for UNIX filtering # building blocks # legos # limitations of metaphors # # Data Example: # 64.51.123.2 - - [05/Mar/2002:13:20:44 -0800] "GET /george/resume/GW_resume.css HTTP/1.1" 200 1534 # # assuming blank as your delimiter: # 1 = ip # 6 = action # 7 = url # 9 = status # # Command Line Example: # cat *.log | grep " /robots.txt " | cut -d" " -f1 | sort | uniq > robots.list # # Intent: To get a list of the all the ip addresses # which accessed /robots.txt during the period covered by # the log files in the current entry. # # Some Thoughts: # regarding how to use # know your data. # you can add a lego, test, add another lego, etc. # cross check your conclusions. # regarding when to use # can be good for research before writing a perl filter. # can often find (out) what you want from the command line. # if it gets too complicated, it may be easier to write a program. # just because you can do it at the command line doesn't mean you have to. # regarding imperfection # it (i.e. the results) don't gotta be perfect. # it's not always critical to do it the "best" way. # you don't need to know all the possibilities to be effective. # regarding man pages # check the man page to recall an option. # check the man page to see if the option you want exists. # check the man page just for fun. # # Focus: # on Apache type log files, blank delimited # with some hints of other possibilites. # you need to use your imagination. # and know your data or be learning about it. # # Convention: # ... means imagine something useful going here # probably ending or starting with a pipe. # and when you see 200203.log # you can think ... too. # # 1. Getting Data to Start With # ----------------------------- # cat 200203.log # cat *.log # cat 20020[56].log # cat */*.log # cat */*/*/*.log [looks weird, but can be very useful] # # head 200203.log # tail 200203.log # # cat north_american.html # ls # ps # # cat 200203.log | grep " /robots.txt " # grep " /robots.txt " 200203.log # # 2. Including and Excluding Lines # -------------------------------- # Including: # grep " /robots.txt " 200203.log # grep " 200 " 200203.log # grep " /maca" 200203.log # # Excluding: # grep -v " /robots.txt " 200203.log # ... egrep -v "\.jpg|\.jpeg|\.gif|\.png" # # Both: # ... grep " /maca" | grep -v "robots.txt " # ... grep " /maca" | grep -v "robots.txt " | egrep -v "\.jpg|\.jpeg|\.gif|\.png" # # Combining with Getting Data: # cat 200203.log | grep " /robots.txt " # longer # grep " /robots.txt " 200203.log # shorter # # 3. Changing Lines # -------------- # ... cut -d" " -f1 ... # to keep only the first field (ip) # ... cut -d" " -f7 ... # to keep ony the seventh field (url) # or write a perl filter [also for any other step] # # 4. Organizing and Summarizing Lines # ----------------------------------- # Sort: # ... sort ... # ... sort -n ... # # Summarizing: # ... uniq # ... uniq -c # ... wc -l # # Some Combinations: # ... sort | uniq # ... sort | uniq -c # # 5. Toolbox # ---------- # Example of Toolbox (mine, sort of): # ##################################################################################### # # -- Commands ----------------------------------------------------------------------# # # are built mostly from Blocks, Connectors below # # # modifier & # # # short cuts !! !# various-aliases [# here means number] # # # -- Blocks ------------------------------------------------------------------------# # # getting started cat head tail more ls, ps # # # including and excluding lines grep, egrep # # # changing lines cut, perl filters # # # organizing and combining lines sort, uniq, wc # # # other more # # # ...... block subparts ............................................................# # # options -# (head, tail) # # # -f -v (grep, egrep) # # # -d and -f (for cut) # # # -n -r -t (sort), -c (uniq) # # # regular expressions ^ $ . * [^-] | # # # paths / * ? digits letters _ . # # # -- Connectors --------------------------------------------------------------------# # # pipe | # # # redirection > >> # # # command connector ; # # ##################################################################################### # # Some Thoughts about this Toolkit: # metaphor is useful for some # don't take 2 seriously # your toolkit would probably look different from above because # different data # different psyche # different experience # besides that's probably not an accurate representation of my toolkit # # can repeat a type of block # can put in different orders # can make your own blocks # # mixed metaphor # tool kit, blocks, etc. # who cares? is it helpful? is it revealing? is it fun? # # 6. Example # ----------- # This whole file can be sourced. # Since everything above is commented # the effect will be to run the example below: # ##### # Function: # determine pages hits in my domain # for whatever log entries are in .log files. # Assumes: # alias ppl='perl -p -e' # Limitations: # assumes robots and noone else will access robots.txt # assumes agents will access robots.txt* # assumes anyone using my portal is me # assumes can match anywhere within the entry # depends on specifics of my site # e.g. no frames, use of portal # # * = turns out to be seriously untrue # # File Name Conventions for Lists # include the word list # X = exclusion # list = list (in alphabetic order unless otherwise indicated) # field included # D = top level directory # I = of isps # U = of urls # C = with count # N = in numeric order (by count) # File Name Conventions for files of lines # page = page hit # lines= log lines # X = with exclusion lists applied # x = temporary file where x is a digit # # ################################################ # 1 2 3 4 5 6 7 #234567890123456789012345678901234567890123456789012345678901234567890 # 6a. aliases alias cut1='cut -d" " -f1' alias cutt2='cut -d" " -f2' # i.e. cut -d"" -f2 alias cut7='cut -d" " -f7' alias ppl='perl -p -e' alias unc='sort | uniq -c' alias unq='sort | uniq' alias sorn='sort -r -n' # 6b. construct exclusion lists cat *.log | grep "robots.txt" | cut1 | sort | uniq > robot.XlistI cat *.log | grep "portal" | cut1 | unq > local.XlistI cat *.log | grep "resume" | cut1 | unc > resume.listIC cat resume.listIC | grep " [1-9][0-9]" | cutt2 > resume.XlistI # 6c. determine page hit lines cat *.log | grep "GET " | grep "\.html " | grep " 200 " > page.lines cat page.lines | ppl 's#/ #/index.html#' > page.lines1 cat page.lines1 | grep -v -frobot.XlistI | grep -v -flocal.XlistI > page.lines2 cat page.lines2 | grep -v -fresume.XlistI > page.linesX # 6d. create url reports cat page.linesX | cut7 | unc > page.listUC cat page.listUC | sorn > page.listUCN cat page.linesX | cut7 | ppl 's#^/(([^/]+)/)?.*#/$2#' | unc > page.listDC cat page.listDC | sorn > page.listDCN