Text Manipulation Cheat Sheet


26 Jun 2017


14 Feb 2019

I recently had the "opportunity" to perform some text manipulation to get data from a huge log file into a spreadsheet. I had done this sort of work a long time ago as a developer, but had forgotten over the years. So I decided to compile a text manipulation cheat-sheet for macOS.

Text Manipulation Utilities

First up, a few of the basic shell utilities: cat, head and tail:

cat filePrint the contents of a plain text file called file.
head -n5 filePrints the first five lines from the file.
tail -n+1 filePrints the last n files from the end of the file, where +1 means "relative to the first line" i.e. to print the last lines of the file starting from the second line.

which prints out the contents of a plain text file.

The shell in macOS is called "bash" and is access via the Terminal App. If you are confused by this statement, then please stop reading!

Next, grep which prints out lines of text matching a given text. grep can find lines that match a "pattern" called a regular expression (regex).

If you are not familiar with regex, there are many on-line tools and tutorials to get you started, e.g. Apple's Regular Expressions Unfettered guide, regular-expressions.info, regexr.com and regex101.com.

grep -A1 text filePrints the lines after the line containing text found in file.
grep -B1 text filePrints the lines before the line containing text.
grep -v text filePrints the lines not containingtext.
grep -i text fileIgnores case when finding text.
grep "^text" filePrints lines starting withtext found in file.
grep "^text\|^test" filePrints lines starting withtextortest.
grep "te.t" filePrints lines matching te followed by any single character (this is what the . means), followed by a t, e.g. it will match text and test, but not tet or tenant.
grep "te.*t" filePrints lines matching te followed by zero or more instances of any character (this is what the * means), followed by a t, e.g. it will match text, test, tet, tenant.
grep -Eo \d{2,4} fileExtended regular expression -E and print only matching parts -o, in this case, print out only two, three or four digit numbers matched.

When using -A and -B, a line with two dashes -- will be used to separate groups of matches.

Next, cut which simply cuts out a portion of the line.

cut -c5-10 filePrint the text cut from the 5th to the 10th character inclusive from file.
cut -f1,3 fileFor tab-delimited content, print the first and third fields.
cut -d" " -f2-5 fileFor space-delimited content, print the 2nd, 3rd, 4th and 5th fields.

Plus, tr which "translates" characters read from standard input (i.e. not from a file like the commands above):

tr " " "\t"Replaces every space character with a tab character
tr "\n" ","Replaces newlines (\n) with a comma, concatenating lines in the process.
tr -s " "Replaces multiple spaces with a single space character.
tr -s " " "\t"Replace multiple spaces with a single tab character.
tr -d " "Deletes all spaces.

Another useful command is pbcopy which copies the content of standard input to the clipboard, ready to paste your data into a word processors or spreadsheet for further analysis (pbpaste prints out the content of the clipboard).

tr and pbcopy read from "standard input" which means they take text typed in from the terminal, instead of reading from a file. That is, tr does not read from a file, instead you'll have to "pipe" (that's this symbol |) input to it, e.g. echo hello world | tr h H would replace all occurrences of lower case h with uppercase.

And finally, there is sort, which sorts text or numbers on separate rows i.e. separated by a newline \n):

sort fileSorts text in ascending order, A-Z then a-z
sort -r fileReverses the sort -r and ignores case -f
sort -n fileSorts numbers, ascending (smallest to largest)
sort -rn fileSorts numbers, descending (largest to smallest)

What if your data is delimited by a space and not a new line? Well, start combining commands above, e.g. echo "d a c b" | tr " " "\n" | sort.

Stringing Them Together

By stringing various commands together, you can easily manipulate the text output e.g. to extract columns, or transform text. I'm going to be reading from file only once using the cat file command to print the file, and then pass the printed output from one command to the next via what is called a "pipe" (|). You can read more about Pipes and Redirection on Apple's Shell Input and Output guide.

In this example, I have a text file called iostat.log containing the output of a couple of runs of the iostat (to get I/O statistics):

              disk0               disk2         cpu    load average
    KB/t  tps  MB/s     KB/t  tps  MB/s    us sy id   1m   5m   15m
   54.20    4  0.25     9.08    0  0.00    14  8 78  3.20 3.05 3.06
              disk0               disk2         cpu    load average
    KB/t  tps  MB/s     KB/t  tps  MB/s    us sy id   1m   5m   15m
   54.20    4  0.22     9.08    0  0.00    14  8 78  3.20 3.05 3.06

So, if I just want the third and fourth column (MB/s and KB/t):

cat iostat.log | tr -s " " "\t" | cut -f4,5

Note the output has a tab character between the columns (which does not get displayed nicely in HTML), which is great for copy-and-pasting into a spreadsheet:

cpu     load
MB/s    KB/t
0.25    9.08
cpu     load
MB/s    KB/t
0.22    9.08

Assuming I had lots of other lines in the file, and I just wanted the data lines after the "header" starting with "MB/s" and "KB/t":

cat iostat.log | grep -A1 "MB/s"

But now the output contains both the line with "MB/s" and the delimiter "--". So, to exclude these lines and combining the earlier command to just get the third and fourth columns nicely separated with tabs:

cat iostat.log | grep -A1 "MB/s" | grep -v "MB/s\|--" | tr -s " " "\t" | cut -f4,5

Which prints out:

0.22    9.08
0.22    9.08

Now, I could save this output to a file called results.txt using "redirection" (>) which you can read more about on the Apple Pipes and Redirection guide:

  • To write to a file called results.txt (over-writing each time), cat iostat.log | grep -A1 "MB/s" | grep -v "MB/s\|--" | tr -s " " "\t" | cut -f4,5 > results.text
  • Or, to keep appending to the same file use >> instead, cat iostat.log | grep -A1 "MB/s" | grep -v "MB/s\|--" | tr -s " " "\t" | cut -f4,5 >> results.text

However, since I want to copy the output data to a spreadsheet, I use pbcopy instead. I find it easier and faster to then swipe/tab over to my spreadsheet, and paste (Command-V) the data! Compare this to the many steps to import a tet file into a spreadsheet, and then still land up copy-and-pasting to an existing spreadsheet. So:

cat iostat.log | grep -A1 "MB/s" | grep -v "MB/s\|--" | tr -s " " "\t" | cut -f4,5 | pbcopy

There you have it - a cheat-sheet for text manipulation, which hopefully saves someone a lot of Googling. My intent is not to cover every possibility, or provide really optimized all-in-one commands, but rather to keep it simple so that anyone can string together the commands to get the output required.

It's better to have a solution that works as expected and is clear, even if it's less than optimal.

Updated 14 Feb 2019: Added a few more examples and sort.