I sometimes need to search a huge CSV file (13 MB), containing 21,000 rows and roundabout 40 columns, which Numbers takes half a minute to open. All I need to do is search and display the results of a few columns only... so I wrote a bash script to do this for me - and I was quite specific in that I wanted the results in color!

My file is a simple CSV file, using a comma (,) delimiter. It does not have commas in the data that require quoting. However, some text is quoted. The main data field in my case was the second column. Take the sample below:

id,name,category,cost,unit,quantity
0,"Blue widget",parts,1.00,box,1000
1,"Green widget",parts,1.02,box,1000
2,"Green sprocket",parts,2.00,box,1000
The script

Without further ado, this is my script:

#!/bin/bash
[[ -z $1 ]] && echo $0 include_term [exclude_term] && exit
file=$(ls -r1 *.csv | head -n1)
if [[ -z $2 ]]; then echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m\n"
else  echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m excluding \033[36m$2\033[0m\n"; fi
while read -r line; do
 IFS=',' read -ra part <<< "${line//\"/}"
 if [[ -z $2 ]]; then echo ${part[1]} | grep -ih --color "$1"
 else echo ${part[1]} | grep -iv "$2" | grep -ih --color "$1"
 fi
 [[ $? -eq 0 ]] && echo -e "  \033[34m${part[3]}\033[0m per ${part[4]}\n"
done < <(grep -ih "$1" "$file")

I saved it as searchcsv.sh and then in Terminal, changed the permissions to make to executable, chmod +x searchcsv.sh.

How it works

I thought I'd explain it line by line, since it looks like too much gobbledygook:

[[ -z $1 ]] && echo $0 include_term [exclude_term] && exit

Checks for one search term - a regular expression pattern, and one optional exclusion term, another regular expression. If not present, displays the instruction and then exits. The funny && syntax is just my alternative to an "if" block: I check exit codes instead. It's functionally equivalent to if [[ -z $1 ]]; then echo $0 include_term [exclude_term]; exit; fi

file=$(ls -r1 *.csv | head -n1)

Gets a list of all CSV files, sorted by most recent, and stores the filename of the first CSV in a variable called file.

if [[ -z $2 ]]; then 
 echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m\n"
else
 echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m excluding \033[36m$2\033[0m\n";
fi

A block to display what the script is searching for (either without an regular expression to exclude, or with). The escape codes starting with \033 set the font color.

while read -r line; do
 ...
done < <(grep -ih "$1" "$file")

This loop uses grep to find lines in the CSV file that match the given regular expression (-i case insensitive and -h without printing the filename). Each line is read into the variable line.

 IFS=',' read -ra part <<< "${line//\"/}"

There's no simple 100% reliable way to parse a complex CSV, but since mine is only comma-delimited, I'm fine to split the line into an array called part, separated with the IFS variable denoting the delimiter. Now I can retrieve the first item in the array with $part[0], use 1 for the second item, and so on. I also strip out the quotes from the line using ${line//\"/}. The <<< operator functions much like the pipe but does not create a subshell just to pass the variable.

 if [[ -z $2 ]]; then 
  echo ${part[1]} | grep -ih --color "$1"
 else
  echo ${part[1]} | grep -iv "$2" | grep -ih --color "$1"
 fi

Here, if there is no regular expression of something to exclude, then I repeat the grep on the second array item (which is very wasteful I know) but I simply wanted to print the results in --color! The else is where I do yet another grep but this time -iv is a case insensitive "invert-match" (i.e. selected lines are those not matching the pattern).

 [[ $? -eq 0 ]] && echo -e "  \033[34m${part[3]}\033[0m per ${part[4]}\n"

Finally, I check the exit code of the last grep with $?. If its zero, then I did find a match, so I print the results. Again, in color.

An astute reader might ask why the original grep at the while loop was not sufficient to find the data I wanted? Well, it's that the grep in the loop would've found matches in all columns, but I only wanted to match in the second column, i.e. ${part[1]}.

The alternative - reading each line and splitting into parts before doing the grep - is awfully slow. Shell is not meant to parse an entire file in an array variable.

Conclusion

So to search, I simply enter ./searchcsv.sh "blue" "sprocket".

I could also use what is effectively an AND in the first part with .*, e.g. "blue.*widget", or an OR with \ e.g. "blue\|widget"

Sample searchcsv.sh script output

Related Posts

Newer Post Older Post