Friday, March 17, 2006

UNIX : Cat every nth (third, fourth, fifth ...) line of a text file using AWK

I was surprised to discover that there's no standard utility in Unix to get every N-th line of a file. Or, rather, I was surprised to fail to find it, if there is such a thing. Many people don't realize that "tac" is the way to list out a file in reverse ("cat" spelled backwards, get it?) but it's there.

I have a very large file, upwards of millions of lines. I need to sample it, which means that depending on the size of the file (which might also be tens of thousands of lines), the N will change - I might take every 10th, or every 100th.

Awk handles this quite nicely, actually:

awk '{if (count++%3==0) print $0;}' filename


Does the trick! Where '3' is your N, of course. If you dropped this into a shell script or something you could get creative and pass it in as a parameter. But it's short enough for my purposes that I don't mind writing it out.

Technorati Tags: , ,

15 comments:

Anonymous said...

Thanks, Duane! I needed to do the same thing with sampling a file, found your blog on Google, and used your AWK command, and voila! I didn't waste time looking for the command for UNIX. :)

Anonymous said...

http://sed.sourceforge.net/sedfaq3.html says You can also use sed:
sed -n '2~5p' file
which prints every 5th line, starting with line 2.
One may be faster than the other, and interesting on large files.

Anonymous said...

I was googling to find awk command to print evry 4th line and wow, your one line works like a charm. Thank you for posting it!!!

Anonymous said...

Duane's brain is insane. Thanks for the useful awk post.

Andy said...

Also found this as the first hit on Google. Thanks. :)

KA6AH said...

Thanks, that works!

KA6AH said...

Thanks, that helps!

Chris Samuel said...

The example of:

awk '{if (count++%3==0) print $0;}

Actually prints out the 1st, 4th, 7th, lines rather than lines 3, 6, 9, etc, because it's post-incrementing count.

To get it to print out lines 3, 6, 9, etc, you need to pre-increment count thus:

awk '{if (++count%3==0) print $0;}

The alternative sed version is:

sed -n '3~3p'

They both have identical execution times on the system I'm playing with, you're more likely to find that the I/O performance of the input file is going to have more bearing on the execution time than anything else. :-)

Hope that helps!

Chris

PS: Any reason why the HTML code and pre tags are banned ?

Anonymous said...

Thanks!

Ádám said...

Thanks a lot!

For example I can print every 5th line after a grep match:

nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=0 a=5 s="string" file1 | awk '{if (++count%5==0) print $0;}

b=before, a=after

Anonymous said...

Thanks, this was very useful!!

Anonymous said...

Thanks, this helped me a lot.

Anonymous said...

Easiest way:
awk 'NR%3==1' file

Also the sed way:
sed -n '1~3' file
only works with the gnu version of sed.

Gopi said...

sed -n '3~3p'

the above command is not working. Its saying Unrecognized command:3~3P

Can you please Help me on this

Thomas Knudsen said...

Shorter version:

awk '0==NR%10' filename

Explanation:

awk programs are pattern-action pairs.

Here the pattern checks whether NR, (awk's built in record number indicator), is divisible by 10 (for printing every 10th record).

The action part is left out, activatig the default action, "print"