I had to parse an access_log of a website in order to generate a sitemap. More precisely, a list of all URLs on the site. After playing around, I found a solution using sed, grep, sort, and uniq. The good thing is that each of these tools is available by default on most Linux distributions.
I had the access log file under access_log (if you have it under a different name/location, just substitute it in the following code). My first attempt parsed out all the URLs that were accessed by POST or GET and sorted the output.
sed -r "s/.*(GET|POST) (.*?) HTTP.*/2/" access_log | sort
After doing so, it turned out that I didn’t need the query string (the part after the ‘?’ in the URL), and I could discard URLs consisting only of ‘/’. So I altered the code to be:
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^?]*?)(?.*?)? HTTP.*/2/"
access_log | grep -v "^/$" | sort
This time I also took care of URLs accessed by methods other than just POST and GET.
After I got this list, I thought it would be nice to have all the duplicate URLs stripped out. A quick search turned up a nice command-line utility called uniq that does just that and is part of the coreutils package.
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^?]*?)(?.*?)? HTTP.*/2/"
access_log | grep -v "^/$" | sort | uniq
So the final solution uses sed to take out the URL part that I wanted. grep discards URLs consisting only of ‘/’. sort and uniq sort the results and remove duplicate lines.
It’s nice how one can integrate different command-line utilities to do this task in a one-liner.