Textprocessing on the commandline

Details: Category: General; Also available:; Created: 05 October 2014; Last Updated: 22 April 2022; Hits: 8521

Today I found in c't 22/2014 22/2014 (German computer journal) a nice article with the same title. Examples show how easy you can combine various commandlien tools in Linux and follow the Unix Philosophy and manipulate without any Programming text files, split and join them. Applications are to process csv files, log files or word frequency analysis and much more. recently I had a similar problem with text files which I solved using most of the tools mentioned in the article. But I was suprosed to read that the tools are much more powerful when you use special parameters. It's a good idea to read the man pages of the tools. On following pages I summarized all tools together with their most powerful parameters.

cut

cut can be used to extract parts of lines.

cut -c5-

This will delete the first 4 characters in a line. It's important to notice, that -c (character) and -f (field) can be areas. The final - causes all characters following the 5th character to be included in the resulting output line. Without - the line will just have the 5th character.

cut -f3,4 -d' '

Extracts field 3 and 4 from text file. Parameter -d defines the character which separates the fields and is a space. If you use the csv separator you can quickly extract columns from the csv.

sort

To join text lines the files usually have to be sorted . Helpful parameters are -b to skip leading spaces, -i to skip non printables characters, -u to eleminate duplicate lines (uniq can be used for this also), -k to define the field which should be sorted on and -t to define the field seperator.

uniq

This tools can be used to handle duplicates in text files. Either count the matched line with -c or delete duplicates with -u (just like -u in sort) or use -d to print all duplicate lines.

sort a.dat | uniq -c | sort -n

The command above creates a frequency list of words if there are single words only in the input lines.

paste

Combine lines of multiple files, that is all nth lines of all files are combined.

paste -d'-' a.dat, b.dat

a.dat with two lines with one and two and b.dat with A and B returns

one-A
two-B

join

Join lines of two files. the files need a common key to join the lines. The lines have to be sorted by the key in the file. sort accomplishes this.

If for exmaple you have a.dat and b.dat

4711 VW
4712 Ferrari

and

4711 yellow
4712 red

the following command will return

join a.dat b.dat

a list where typa and color of the cars are joined.

4711 VW yellow
4712 Ferrari red

Other useful paramaters are -d to define the field separator, -i to match case insensitive and -e to fill empty fields with spaces. If the key is not the first field element you have to use -1 and -2.

grep

Just use regular expressions to search for lines in the file and extract them. -v inverts the search and print all lines NOT matching the regex. -i ignores case. -w searche for whole words.

grep "4711" a.dat

just finds the line containing VW

yes

The purpose of this tool is to return yes all the time. But it can be used to create test data very easily also.

yes "Hello world" | head -100

creates a file with 100 lines with text "Hello world"

nl

This tool usually is used to prefix all lines with line numbers. But you can insert any string in front of the lines with a simple trick:

nl -s "echo " a.dat | cut -c7-

Append with -s and additional text trailing the line number and delete the line number afterwords with cut. THis can be done iwth awk or sed also.

tr

If some characters should be deleted or be replaced by other characters tr is one choice. Use -d to delete characters. Charactergroups as [:upper:], [:punct:], [:space:] etc can be used. This can indeed also be done with awk or sed.

tr -d [:punct:]

This command deletes all punktuations characters.

tr '\n' '\r\n'

That's the way to convert the Linux line end character for windows or vice versa. (Similar as dos2unix and unix2dos do)

tr -s ' '

Truncates multiple space characters to one space character.

fold

Folding of lines at the end of a number of characters.

fmt

Frmats lins and paragraphs much more intelligent as fold.

fmt -0 a.dat

This way a line is split into lines with only one word.

A nice oneliner described in the c't article which uses most of the toole mentioned above. The same task can be solved with a program also. But why should you write a program when there are well tested Linux tools available which you can combine to achieve the task much faster?

fmt -0 artikel.txt | tr -d [:punct:] | grep -w -i -v -f stopwords.txt | sort | uniq -c | sort -n

This command extracts from a file artikel.txt all words, which are not listed in the stopwordlist and the frequency of the works is calculated. Just paste the result on wordle.net and you get a nice wordcloud of the article.