Back to Download Form

Is it Art? Or Information?

There are many problems with the exchange and collation of information from websites with important data to share. Even in highly technical areas with computer literate users, such as bioinformatics, artistic or proprietary formats impede the free usage of information because the results can only be understoof by a human user or a limited set of automated programs. In some cases, such as SEC, FCC or FDA filings in PDF format, it is possible for private filers to submit information in a legally correct way that defeats complete usage of the submitteed information. The common offense with FDA filings is to use a scanned PDF format which doesn't OCR easily and, in the case of FCC filings, the improper usage of Adobe restrictions makes sharing information difficult. Most SEC filings seem to be available in text format or allow easy extraction of text from other formats.

This situation seems surprising as you would think that machine-to-machine communications would be an obvious feature to implement on automated data processing equipment. Today, however, the web-interface is all most managers think about. To illustrate one problem, consider this bioinformatics site that offers a protein search function, but as of today absolutely REQUIRES Flash 9,
Protein Structure Lookup
You may think that a 3D structure database would require some graphics but this system does seem to allow structure lookup based on various metrics ( AFAIK, there are no databases that are graphical in nature). So, a text based API would seem to be reasonable and a requirement for a recent proprietary graphics system would seem to be limiting the site's utility.

The FCC requires filings but doesn't seem to care if the filings can be easily used by the public. I can't see how any human could manually scan these for all information of interest and automated analysis (search/process) would seem to be desirable. After trying to extract and analyze text from these results,
FCC Filings Search
it becomes obvious that a large faction of the documents are "protected." It appears that these protections are legal and in this context ethical to defeat,
PDF Security Situation
but most users would not have to ability to do so.

The FDA presents data that is similarly limited, but the limitation is that many important documents are submitted in a scanned format that can't be searched. Look up some of the medial or related reviews from this data base,
Drugs at FDA- Scanned text for medical reviews
and even try to OCR some of them with public utilities such as gocr. Clearly, there is a lot of interesting data in the complete clinical trial filings but it can't be easily collated.

The PDF extraction problem.

Most problems extracting information from PDF files are fixed with this pdf utility,
PDF Information Export utilities
but the public distribution doesn't allow easy extraction of inappropriately "protected" contents. This also does not address scanned PDF files containing textual information. There really exists a need for good OCR software but I don't have any right now.

The HTML parsing problem

Parsing formatted output is a very tedious and often unrewarding task. Output formats change with little notice as there is no reason for an interactive site to keep the output stable. There are at least two solutions, however.
Complain to the Webmaster or Author or Administrator

I have had a hard time selling the need for an API to the bioinformatics community. So far, I have gotten limited support from most webmasters. But, if you do write, I would continue to offer eutils as a good example,
NCBI eutils API versus web interface
The above link explains that this differs from the web interface and shows a simple API.While investigating a similar problem for SEC filings, I was pointed to more complicated protocols such as SOAP, or various things based on exchange of XML data ( not somethingI think is suitable for simple tasks, but it is an example with a lot of canned support ).

An example: Epitope Prediction Servers
Let me point to a specific situation I recently encountered. There are several web sites with epitope prediction servers. Epitopes are small piece of proteins, but for the sake of this discussion, just think of a protein as a long string of letters and the epitopes are carefully chosen words from that text. The usage pattern is that you upload a protein,select a few parameters, and eventually you get back a list of possible epitopes along with some metrics.

A typical Epitope Server
A similar Epitope Server

So, you may cut/paste a 1000-100000 charcter long protein sequence and get back hundreds of interesting things ( similar to words isolated from a text document ) and different servers may give different results. What do you expect a typical user to do with this data? In my case, I had a more complicated task as I was looking for incorrectly translated DNA sequences that could lead to accidental epitope matches with normal proteins. So, I had to use things like fake ribosomes ( a homemade document generator ) on some edited chromosome sequences, upload the hypothetical translation product to a series of servers to isolate the epitopes ( or words ) of interest, and then collate all the results, and then do various additional searches with these peptides( or words). How do you do that with a web interface at every point? Offhand, what is the use case for a web interface on an epitope server that will routinely return 100's of hits with no uniform way to select interesting and uninteresting results?

Unfortunately, I had to write scripts to extract data from the html from several of these sites. In many cases, I could use the linux tool "lynx" to render the html so I didn't have to parse html source but you can't always rely on this. As html is meant to be interactive, there is rarely a guarantee from the site that it will be stable enough for robust parsing. A simple API with a text response option would have addressed my needs using results that the site probably had to generate internally to make the html results.

Now, I don't expect a completely uniform information format between all sites ( ANSI needn't issue a standard for epitope information exchange), but there are formats ranging from simple text ( my preference ) to XML ( with lots of canned support) that can be easily parsed and remain stable even as user artwork is changed to accomodate fashion fads.

I hadn't realized that the API idea would need to be "sold" but I guess I under estimated how many people think that the graphics are the model- and I think that I recognize the shortage of skilled people in the computer area and the skill set that many professionals in the area possess so I guess I should have known better :)

If you are still not convinced, it may be worth while to continue this discussion and see if you have more objections that I can't satisfactorily address because I consider it to be an important problem. marchywka@hotmail.com