My heart sinks whenever I witness people resorting to Excel to compare large files. It's kind of ok when the files to be compared are below 10K rows in size... anything bigger than that and the time it takes to select the rows to compare (and the comparison in itself) becomes too much of a frustration.
The alternative is to use the command line of course.
To compare two csv files file1.csv and file2.csv, columns to columns, on a Linux operating system.
step 1. copy all lines containing the string to search from the first input file
grep 'searchString' file1.csv > f1.csv
step 2. extract the relevant columns (here columns 3,4 and 5)
cut -d',' -f3,4,5 f1.csv > cols_f1.csv
step 3. sort on the 2nd column (for example)
sort -k2 -t"," cols_f1.csv > sorted_f1.csv
step 4. remove duplicates
uniq sorted_f1.csv > uniq_f1.csv
Quite a bit of typing here... and that's only to extract the columns from the first file. Fortunately all of these commands can be piped.
steps 1,2,3 and 4 for the second file.
grep 'searchString' file2.csv | cut -d',' -f3,4,5 | sort -k2 -t"," | uniq > uniq_f2.csv
And finally the last step. Show all lines unique to file1, all lines unique to file2 and all lines common to both files, arranged in a 3-columns output
comm uniq_f1.csv uniq_f2.csv
No comments:
Post a Comment