Data formats

 

DNA sequence information resulting from custom sequencing is provided in different files for every sequence. A brief description of the file formats is given below.

All files except .scf files contain pure ANSI text. They can be viewed with every text viewer, text editor, word processor and most e-mail applications. To get the best view of these data, the application you use should be configured to display a fixed width font like Courier.

 

.scf files

Files in standard chromatogram file or SCF format contain all relevant sequence information. This includes the processed sequencing raw data, the base sequence established in a manner compatible to phred, and quality data for every base, illustrating the confidence of every single base. The SCF format is generally accepted as the standard file format for exchanging and viewing DNA sequence data. scf files can be viewed and processed by many applications used for DNA sequence analysis. For Windows-based PCs, the simple SCF viewer Chromas is available from the web.

 

.fas files

Text files in fasta format have the extension .fas. They contain the complete DNA sequence, unaware of the quality of its constituent bases. Above the DNA sequence, a header line gives valuable information about the sequence data. The header line is initiated by a ">", followed by the sequence name. Next to the sequence name, separated by a semicolon, length and position of the reliable, high-quality region of the base sequence within the overall sequencing results is indicated. In the following lines all computed (or "called") bases are given. Bases of lower quality at the 5' and 3' ends are given in minor letters. Those bases estimated as reliable are highlighted by capital letters. Stripped off the low quality 5' and 3' sequence data, the high-quality sequence region indicated by the capital letters is also available in a separate .seq file, see below.

 

.seq files

The .seq files contain the high quality sequence data, as they are indicated by the phred base calling software. Further sequence data including the lower quality 3' and 5' regions are given in the .fas and in the .scf files.

 

.map files

Based on the high quality portion of the sequence data, as it is given in the .seq files, a restriction map is given, indicating single cutting enzymes.