I had to parse an access_log of a website, in order to generate a sitemap. More precisely, a list of all URLs in the site. After playing around I’ve found a solution using
uniq. The good thing that each of this tools is available by default on most Linux distributions.
I had the access log file under
access_log (if you have it under different name/location just substitute it in the following code. My first attempt parsed out all the URLs which where accessed by POST or GET and sorted the output.
sed -r "s/.*(GET|POST) (.*?) HTTP.*/\2/" access_log | sort
After doing so, I turned out that I don’t need the the query string (the part after the ‘?’ in the url) and I can discard URLs consisting only of ‘/’. So I altered the code to be:
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \ access_log | grep -v "^/$" | sort
This time I also took care of URLs access by other ways than just POST and GET.
After I got this list, I thought it would be nice to have all the duplicate URLs stripped out. A quick search turned out that there is a nice command-line utility called
uniq that does just that and is part of the
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \ access_log | grep -v "^/$" | sort | uniq -u
So the final solution uses
sed to take out the URL part that I wanted.
grep discards URLs consisting of only ‘/’.
uniq sort out the results and dumps all the duplicate lines.
It’s nice how one can integrate different command line utilities to do this task in one-liner.