Checking validity of input data
===============================

When analysing large amounts of data we are constantly confronted with
a question: are input and intermediate data in a right form, and do
they conform to processing programs' input requirements? Deviations
from the correct input can be present already in the input data, or
can be caused by mistakes in the processing pipeline. For sure such
deviations can distort computation results and lead to incorrect
conclusions.

Here are three simple methods how you can check, and with one method
even validate, the inputs and outputs of your programs. The first
three examples will be applicable to data in tabular form, the one
that is most naturally obtained from GNU/Linux and Unix tools, and the
one which we used in our experiments. The example of such file is
given below:

	saulius@varanas 3rd-assignment/ $ head -n 4 frequency.lst
	    860 10.1038/ncomms15123
	    364 10.1016/j.str.2016.06.010
	    152 10.1016/j.tube.2014.12.003
	     86 10.1021/jm200642w

Method 1. Count the columns
---------------------------

Count number of columns in each of your data files. The number of
columns must be the same in all lines, except maybe the file header or
comments (which you can easily filter out with 'grep'). The column
number must also be suitable for the next processing program. Number
of columns can be easily obtained by printing out the 'awk' NF
("Number of Fields") variable:

	saulius@varanas 3rd-assignment/ $ awk '{print NF}' frequency.lst | uniq -c | head
	  27618 2
	 137709 4
	  30364 2

As you see, in this case some lines (in fact, a lot – 137709 of them!)
have four columns, which is already suspicious. You can print out such
columns with 'awk':

	saulius@varanas 3rd-assignment/ $ awk '{if( NF > 2 ) print NR, $0}' frequency.lst | head -2
	27619       1 ==> ./outputs/downloads/pdb/9x/9xim.biblst <==
	27620       1 ==> ./outputs/downloads/pdb/9x/9xia.biblst <==

The NR variable holds the current line number, so you also obtain the
position of the "strange" lines in the file.

As a historical digression we can note that the method reminds the
one used by Masoretes to count letters in each line of hand-written
manuscript copies (https://en.wikipedia.org/wiki/Masoretic_Text#Numerical_Masorah),
so that various copying mistakes can be detected.

I use this method every time I get data tables, can can highly
recommend it to you.

Method 2. Check the random sub-sample of your data
--------------------------------------------------

Inspecting head and tail of your (large) data tables is a good habit,
but what of faulty lines are in the middle? In this case, inspecting a
random sample of your data lines may help, especially of the faulty
lines are not to rare. The 'shuf' GNU tool will give you such sample:

	saulius@varanas 3rd-assignment/ $ shuf frequency.lst | head -n 4
	      1 ==> ./outputs/downloads/pdb/4h/4hr4.biblst <==
	      1 10.1016/J.STR.2005.07.025
	      1 ==> ./outputs/downloads/pdb/2f/2fsg.biblst <==
	      2 10.1110/PS.03518104

As you see, the problematic lines show up immediately. Of course you
will not find them that easily of there is just one or two such lines
among hundreds of thousands. To increase the probability of detecting
faulty lines, you can run the 'shuf ...' pipeline several times.

Method 3. Check lines using regular expressions.
------------------------------------------------

Since your data must follow certain syntax, it is nearly always
possible to write a regular expression that matches *correct*
lines. Then, you can invert you selection to get incorrectly formatted
lines, and you can select the correct lines for further processing
using 'grep'. Perl Compatible Regular Expressions (PCRE) are worth
considering because of their power and ease of use. They are supported
by 'perl' and GNU 'grep -P' commands:

	saulius@varanas 3-homework-assignment/ $ head -2 frequency.lst 
	    860 10.1038/ncomms15123
	    364 10.1016/j.str.2016.06.010
	saulius@varanas 3-homework-assignment/ $ grep -P '^\s*[0-9]+\s+10\.[0-9]+/' frequency.lst | head -2
	    860 10.1038/ncomms15123
	    364 10.1016/j.str.2016.06.010
	saulius@varanas 3-homework-assignment/ $ grep -v -P '^\s*[0-9]+\s+10\.[0-9]+/' frequency.lst | head -2
	      1 ==> ./outputs/downloads/pdb/9x/9xim.biblst <==
	      1 ==> ./outputs/downloads/pdb/9x/9xia.biblst <==
	saulius@varanas 3-homework-assignment/ $ grep -v -P '^\s*[0-9]+\s+10\.[0-9]+/' frequency.lst | tail -2
	      1 ==> ./outputs/downloads/pdb/10/100d.biblst <==
	      1 10.1126
	saulius@varanas 3-homework-assignment/ $ grep -v -P '^\s*[0-9]+\s+10\.[0-9]+/' frequency.lst | grep -v == | tail -2
	      1 10.1126

With this check, we see that there is a line with the wrongly
formatted DOI, present just once on all text. Thus, regular
expressions, although they take some time and ingenuity to compose,
allow you to make a very thorough filtering of your data.

We can use 'find' to figure out where does the misformatted DOI line
come from:

	saulius@varanas 01-darbas/ $ find ~/GNU-type-OS/data/rsync-demo/saulius-grazulis.lt/outputs/ -name '*.biblst' | xargs grep -lP '10.1126(\s|$)'
	/home/saulius/GNU-type-OS/data/rsync-demo/saulius-grazulis.lt/outputs/downloads/pdb/1u/1u04.biblst

Fetching data from the PDB shows that the misformatted data item come
from the PDB:

	saulius@varanas 01-darbas/ $ curl -sSL https://www.pdb.org/pdb/files/1u04.cif | grep _DOI
	_citation.pdbx_database_id_DOI      10.1126 

Method 3 expanded: check data against schema
--------------------------------------------

Regular expressions are just the simplest for of grammars that allow
you to check whether your data conform to some specific syntax. They
work for any tabular form data; for example, FASTA or PDB files can
also be validated using regexps. For more structures formats, like
XML, CIF or JSON, more elaborate checks exists:

a) for XML files, you can use XML schema to check data:

	saulius@varanas 01-darbas/ $ curl -sSL https://www.pdb.org/pdb/files/1u04.xml > 1u04.xml

	saulius@varanas 01-darbas/ $ grep schemaLocation 1u04.xml 
	   xsi:schemaLocation="http://pdbml.pdb.org/schema/pdbx-v50.xsd pdbx-v50.xsd">

	saulius@varanas 01-darbas/ $ xmllint --schema http://pdbml.pdb.org/schema/pdbx-v50.xsd --noout 1u04.xml
	1u04.xml validates

Note that although XML schema allow to specify regexps to check
structure of values (such as DOI) in .xml files, the current XML
schema fails to discover the faulty DOI.

b) for CIF files (crystallographic interchange files), validation
against CIF Dictionaries has the same function as 'xmllint' validation
against XML schema; see our on-going work in cod-tools
(https://github.com/cod-developers/cod-tools)

b) for JSON files, a similar schema are being developed
(http://json-schema.org/); validating can be performed using Perl
JSON::Validator module. On Ubuntu and LinuxMint, install
libjson-validator-perl package ('apt install libjson-validator-perl'),
and then use a Perl wrapper (our one can be fetched here:
svn://saulius-grazulis.lt/scripts/json-validator, fetch with, e.g.,
'svn co svn://saulius-grazulis.lt/scripts' or 'svn cat
svn://saulius-grazulis.lt/scripts/json-validator').