There are many problems with the exchange and collation of information from websites with important data to share. Even in highly technical areas with computer literate users, such as bioinformatics, artistic or proprietary formats impede the free usage of information because the results can only be understoof by a human user or a limited set of automated programs. In some cases, such as SEC, FCC or FDA filings in PDF format, it is possible for private filers to submit information in a legally correct way that defeats complete usage of the submitteed information. The common offense with FDA filings is to use a scanned PDF format which doesn't OCR easily and, in the case of FCC filings, the improper usage of Adobe restrictions makes sharing information difficult. Most SEC filings seem to be available in text format or allow easy extraction of text from other formats.
This situation seems surprising as you would
think that machine-to-machine communications would be an obvious
feature to implement on automated data processing equipment.
Today, however, the web-interface is all most managers
think about. To illustrate one problem, consider this bioinformatics
site that offers a protein search function, but as of today absolutely REQUIRES Flash 9,
Protein Structure Lookup
You may think that a 3D structure database would require some graphics but this system
does seem to allow structure lookup based on various metrics ( AFAIK, there are no
databases that are graphical in nature). So, a text based API would seem
to be reasonable and a requirement for a recent proprietary graphics system would
seem to be limiting the site's utility.
The FCC requires filings but doesn't seem to care if the filings can be easily used
by the public. I can't see how any human could manually scan these for all information
of interest and automated analysis (search/process) would seem to be desirable.
After trying to extract and analyze text from these results,
FCC Filings Search
it becomes obvious that a large faction of the documents are "protected."
It appears that these protections are legal and in this context ethical to defeat,
PDF Security Situation
but most users would not have to ability to do so.
The FDA presents data that is similarly limited, but the limitation is that many important
documents are submitted in a scanned format that can't be searched. Look up some of the
medial or related reviews from this data base,
Drugs at FDA- Scanned text for medical reviews
and even try to OCR some of them with public utilities such as gocr. Clearly,
there is a lot of interesting data in the complete clinical trial filings but
it can't be easily collated.
I have had a hard time selling the need for an API to the bioinformatics community.
So far, I have gotten limited support from most webmasters. But, if you do write,
I would continue to offer eutils as a good example,
NCBI eutils API versus web interface
The above link explains that this differs from the web interface and
shows a simple API.While investigating a similar problem for SEC filings,
I was pointed to more complicated protocols such as SOAP,
or various things based on exchange of XML data ( not somethingI
think is suitable for simple tasks, but it is an example with a
lot of canned support ).
So, you may cut/paste a 1000-100000 charcter long protein sequence and get back hundreds of interesting things ( similar to words isolated from a text document ) and different servers may give different results. What do you expect a typical user to do with this data? In my case, I had a more complicated task as I was looking for incorrectly translated DNA sequences that could lead to accidental epitope matches with normal proteins. So, I had to use things like fake ribosomes ( a homemade document generator ) on some edited chromosome sequences, upload the hypothetical translation product to a series of servers to isolate the epitopes ( or words ) of interest, and then collate all the results, and then do various additional searches with these peptides( or words). How do you do that with a web interface at every point? Offhand, what is the use case for a web interface on an epitope server that will routinely return 100's of hits with no uniform way to select interesting and uninteresting results?
Unfortunately, I had to write scripts to extract data from the html from several of these sites. In many cases, I could use the linux tool "lynx" to render the html so I didn't have to parse html source but you can't always rely on this. As html is meant to be interactive, there is rarely a guarantee from the site that it will be stable enough for robust parsing. A simple API with a text response option would have addressed my needs using results that the site probably had to generate internally to make the html results.
Now, I don't expect a completely uniform information format between all sites ( ANSI needn't issue a standard for epitope information exchange), but there are formats ranging from simple text ( my preference ) to XML ( with lots of canned support) that can be easily parsed and remain stable even as user artwork is changed to accomodate fashion fads.
I hadn't realized that the API idea would need to be "sold" but I guess I under estimated how many people think that the graphics are the model- and I think that I recognize the shortage of skilled people in the computer area and the skill set that many professionals in the area possess so I guess I should have known better :)
If you are still not convinced, it may be worth while to continue this discussion and see if you have more objections that I can't satisfactorily address because I consider it to be an important problem. marchywka@hotmail.com