I sometimes need to search a huge CSV file (13 MB), containing 21,000 rows and roundabout 40 columns, which Numbers takes half a minute to open. All I need to do is search and display the results of a few columns only... so I wrote a bash script to do this for me - and I was quite specific in that I wanted the results in color!
My file is a simple CSV file, using a comma (,
) delimiter. It does not have commas in the data that require quoting. However, some text is quoted. The main data field in my case was the second column. Take the sample below:
id,name,category,cost,unit,quantity
0,"Blue widget",parts,1.00,box,1000
1,"Green widget",parts,1.02,box,1000
2,"Green sprocket",parts,2.00,box,1000
The script
Without further ado, this is my script:
#!/bin/bash
[[ -z $1 ]] && echo $0 include_term [exclude_term] && exit
file=$(ls -r1 *.csv | head -n1)
if [[ -z $2 ]]; then echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m\n"
else echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m excluding \033[36m$2\033[0m\n"; fi
while read -r line; do
IFS=',' read -ra part <<< "${line//\"/}"
if [[ -z $2 ]]; then echo ${part[1]} | grep -ih --color "$1"
else echo ${part[1]} | grep -iv "$2" | grep -ih --color "$1"
fi
[[ $? -eq 0 ]] && echo -e " \033[34m${part[3]}\033[0m per ${part[4]}\n"
done < <(grep -ih "$1" "$file")
I saved it as searchcsv.sh
and then in Terminal, changed the permissions to make to executable, chmod +x searchcsv.sh
.
How it works
I thought I'd explain it line by line, since it looks like too much gobbledygook:
[[ -z $1 ]] && echo $0 include_term [exclude_term] && exit
Checks for one search term - a regular expression pattern, and one optional exclusion term, another regular expression. If not present, displays the instruction and then exits. The funny &&
syntax is just my alternative to an "if" block: I check exit codes instead. It's functionally equivalent to if [[ -z $1 ]]; then echo $0 include_term [exclude_term]; exit; fi
file=$(ls -r1 *.csv | head -n1)
Gets a list of all CSV files, sorted by most recent, and stores the filename of the first CSV in a variable called file
.
if [[ -z $2 ]]; then
echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m\n"
else
echo -e "Searching \033[31m$file\033[0m for \033[34m$1\033[0m excluding \033[36m$2\033[0m\n";
fi
A block to display what the script is searching for (either without an regular expression to exclude, or with). The escape codes starting with \033
set the font color.
while read -r line; do
...
done < <(grep -ih "$1" "$file")
This loop uses grep
to find lines in the CSV file that match the given regular expression (-i
case insensitive and -h
without printing the filename). Each line is read into the variable line
.
IFS=',' read -ra part <<< "${line//\"/}"
There's no simple 100% reliable way to parse a complex CSV, but since mine is only comma-delimited, I'm fine to split the line into an array called part
, separated with the IFS
variable denoting the delimiter. Now I can retrieve the first item in the array with $part[0]
, use 1
for the second item, and so on. I also strip out the quotes from the line using ${line//\"/}
. The <<<
operator functions much like the pipe but does not create a subshell just to pass the variable.
if [[ -z $2 ]]; then
echo ${part[1]} | grep -ih --color "$1"
else
echo ${part[1]} | grep -iv "$2" | grep -ih --color "$1"
fi
Here, if there is no regular expression of something to exclude, then I repeat the grep
on the second array item (which is very wasteful I know) but I simply wanted to print the results in --color
! The else is where I do yet another grep
but this time -iv
is a case insensitive "invert-match" (i.e. selected lines are those not matching the pattern).
[[ $? -eq 0 ]] && echo -e " \033[34m${part[3]}\033[0m per ${part[4]}\n"
Finally, I check the exit code of the last grep
with $?
. If its zero, then I did find a match, so I print the results. Again, in color.
An astute reader might ask why the original grep
at the while
loop was not sufficient to find the data I wanted? Well, it's that the grep
in the loop would've found matches in all columns, but I only wanted to match in the second column, i.e. ${part[1]}
.
The alternative - reading each line and splitting into parts before doing the grep
- is awfully slow. Shell is not meant to parse an entire file in an array variable.
Conclusion
So to search, I simply enter ./searchcsv.sh "blue" "sprocket"
.
I could also use what is effectively an AND in the first part with .*
, e.g. "blue.*widget"
, or an OR with \
e.g. "blue\|widget"