PATH: Instructional Server> Computer Basics>

DATA TYPES & LANGUAGES


Computers are used to represent and manipulate many different types of data, including (but not limited to):

Many different languages have been developed to represent each of these types of data. One language, for example, was defined to store whole numbers (usually called integers in the computer industry). A different language was developed to store numbers that might have decimal precision, such as 8.45 (such numbers are called real or floating point numbers). Each language has its strengths and weaknesses. Some are simple, small, and fast; others are large, complex and powerful. In fact, almost every new program that is developed defines a new language for the type of data it manipulates. Many languages are copyrighted, motivating competing software companies to develop newer languages of their own to avoid paying royalties.


COMMON CHARACTERISTICS OF LANGUAGE

All languages, whether written, spoken, gestural, etc., have some common fundamental characteristics.

All of the meanings above are reasonable, given a specific context. But without knowing the context, we can not be sure which standard should be applied when reading an "X" to determine the idea it represents. Recognition would be much easier if the symbol "X" had only one meaning. Unfortunately, different languages were developed at different times, by different people. The "X" has proved to be a useful symbol in many different languages. So, we must accept the need for recognition of context between languages. Being conscious of these concepts of multiple standards and context will help you greatly to minimize your own confusion regarding the many computer languages that have been developed over the years for a variety of different purposes.


WAYS THAT SPECIFIC LANGUAGES DIFFER

Remember that for each type of data (characters, numbers, graphics, etc.), there can be more than one data format defined. For the most part, the basic characteristics of a data format depend on the type of data being represented. A language that represents the small set of characters found on most keyboards will not be as complex as one used to represent the millions of possible colors in a graphic image. The standards that define the language will also produce limitations. One of the data formats used in computers to store whole numbers can only be used to store positive values between 0 and 256. There is no way to store numbers outside of that range in that data format. A different data format for storing whole numbers can be used to store a broader range of values (between negative 32768 and positive 32767), but it still has limits. Sections of this document below describe computer data formats for some of the popular types of data, such as characters and numbers, and illustrate their strengths and limitations.


DATA REPRESENTATION

Computers represent data by simply turning a group of circuits ON or OFF in a pattern. To illustrate this technique, the following convention will be used:

A circuit that is ON will be illustrated as a switch in the up position: On Switch
One that is OFF will look like a switch in the down position: Off Switch

In the United States, most computers represent the character "A" as the pattern: Off SwitchOn SwitchOff SwitchOff SwitchOff SwitchOff SwitchOff SwitchOn Switch

Each character has its own unique pattern. A simple notation was developed to describe the position of each switch. The numeral zero (0) is used to portray a switch that is off, and the numeral one (1) is used to indicate a switch that is on. Thus, the coding pattern for the letter "A" can now be represented simply by writing: 01000001. This notation is called "binary notation", because it only uses two ("bi") numerals ("nary"). Each of the two numerals is referred to as a "bit" (short for "binary digit".) When these bits are grouped into a pattern for the purpose of representing a character, the group of bits is known as a "byte".

American PC's usually have a byte size of 8 bits because eight bits can be combined to form 256 different patterns of ON and OFF switch settings. Why? A look at the data format used to code character data will explain.

Character Data

The table below provides a framework for discussion of the data type known as "characters" by illustrating the wide variety of symbols that a typical personal computer is required to represent. The broad diversity of computer users in the world requires the definition of many different character sets. The one illustrated in the table below is specific to American users.

A TYPICAL AMERICAN CHARACTER SET

Symbol Category Example(s) Quantity
STANDARD "TEXT":    
Alphabetic (Letters) "A" - "Z" and "a" - "z" 52
Numeric (Numerals) "0" - "9" 10
Special (Punctuation) @,#,$,%,&,(comma),.,*,(space) 34
Control (Layout) Carriage Return, Tab, Backspace 32
  Sub-total: 128
"EXTENDED CHARACTERS":    
Scientific/Technical ±, ÷, µ, Ø, ß, etc. varies
International é,ä,ç,æ,ô,£,¥,¿, etc. varies
Line Graphics Line Graphics Charactersetc. varies
  Sub-total: 128
Total number of physical character representations required: 256

So 256 is the quantity of characters that was indicated as typical in an American character set. The following page provides examples of the standard coding patterns used for a few of the characters in the previous table. A standard for these coding patterns has been defined in the computer industry. The standard is referred to as ASCII, the "American Standard Code for Information Interchange". This standard is used for the representation of characters on almost all personal computers in the United States and for most data communications systems (such as e-mail) world-wide. Devices that translate input into the ASCII form or which translate output from the ASCII form are known as ASCII devices.

EXAMPLES OF REPRESENTING CHARACTERS WITH THE ASCII CODE:

Category Symbol Switch Settings Binary Code
       
Uppercase: A Off Switch On Switch Off Switch Off Switch Off Switch Off Switch Off Switch On Switch 0 1 0 0 0 0 0 1
  B Off Switch On Switch Off Switch Off Switch Off Switch Off Switch On Switch Off Switch 0 1 0 0 0 0 1 0
Lowercase: a Off Switch On Switch On Switch Off Switch Off Switch Off Switch Off Switch On Switch 0 1 1 0 0 0 0 1
  b Off Switch On Switch On Switch Off Switch Off Switch Off Switch On Switch Off Switch 0 1 1 0 0 0 1 0
Numeral: 0 Off Switch Off Switch On Switch On Switch Off Switch Off Switch Off Switch Off Switch 0 0 1 1 0 0 0 0
  1 Off Switch Off Switch On Switch On Switch Off Switch Off Switch Off Switch On Switch 0 0 1 1 0 0 0 1
  2 Off Switch Off Switch On Switch On Switch Off Switch Off Switch On Switch Off Switch 0 0 1 1 0 0 1 0
Special: [space] Off Switch Off Switch On Switch Off Switch Off Switch Off Switch Off Switch Off Switch 0 0 1 0 0 0 0 0
  # Off Switch Off Switch On Switch Off Switch Off Switch Off Switch On Switch On Switch 0 0 1 0 0 0 1 1
Control: [tab] Off Switch Off Switch Off Switch Off Switch On Switch Off Switch Off Switch On Switch 0 0 0 0 1 0 0 1
  [return] Off Switch Off Switch Off Switch Off Switch On Switch On Switch Off Switch On Switch 0 0 0 0 1 1 0 1

Don't concern yourself with trying to learn the ASCII codes. The average user has little need for that, since all of the peripheral devices, such as keyboards, display screens, and printers, serve as ASCII translators for us. When you press the character "A" on a keyboard, the appropriate ASCII pattern is generated and sent to the program that is currently executing. Similarly, when the ASCII code for the numeral five is sent to a screen, it will translate that pattern and display the image of a "5". Most standard input and output devices for personal computers are manufactured to be ASCII translators. If you want to know the ASCII code for a character, an ASCII table can be found easily [on the web] and often in the appendix of computer books and manuals.

Actually, the original ASCII code was standard only for the 128 characters identified as "TEXT" in the TYPICAL AMERICAN CHARACTER SET table above. The characters identified as "Extended Characters" did not have a uniform standard, which means that they would not always translate well between different brands or types of equipment. A newer 8-bit standard has been established named ASCII-8 that retains the original 128 coding patterns and adds another 128 patterns for the extended characters. All of the standard text characters are represented by code patterns which place a zero in the new (eighth) bit position (see the examples above). Thus, standard text really only requires seven bits to be represented. That is why many data communications standards (such as e-mail) only use 7 data bits in their transmission of character data. It is faster than transmitting eight data bits per character, but does not allow for the extra 128 extended characters which all use code patterns in which the high (leftmost, eighth) bit is set to 1.

Numeric Data

Keep in mind that the ASCII language standard applies only to character data. Other data types, such as numbers and instructions each have their own data formats. For example, numbers normally are not represented as numerals (characters) inside a computer because a computer does not perform its arithmetic using character data. Numbers are more efficiently manipulated in a computer using a language that defines a coding pattern for each value (as opposed to each individual digit). This technique treats each number as a single piece of data instead of as separate digits that must be manipulated as a group. The data formats used for representing numbers in computers all involve the use of the binary numbering system (also known as "Base 2"). A detailed explanation of the binary numbering system is beyond the scope of this document, but basically, it is very similar to the decimal number system, with the some minor differences.

The decimal numbering system has ten digits in it. Why? Because humans (the creatures who defined it) have ten digits (their fingers) at the ends of their limbs. So the use of ten symbols to represent any number was easy for humans to relate to. The ten decimal numerals allow us to write any value between zero and nine (inclusive) with only one digit. But, if we want to indicate a value greater than ten, we have to use more than one numeral. We "place" the numerals side by side, with the position of each numeral indicating a power of ten (the highest ones on the left). We increase by powers of ten in each place because every new place could have any one of ten possible digits in it, which would allow us to represent ten times as many values as the previous places (to the right).

Binary works the same way, except that there are only two digits in the binary system, "0" and "1". While each column in the decimal system is worth ten times the column to its right, each column in the binary system is worth only double the power of the column to its right. The value of the columns in the decimal system are (from right to left): 1, 10, 100, 1000, etc. The value of the columns in the binary system are (from right to left): 1, 2, 4, 8, etc. The following table compares the decimal system for representing numbers to the binary system for the values zero through fifteen.

    Value       Decimal   Binary   Switch Pattern  
Fifteen 0 1 5   0 0 0 0 1 1 1 1   Off Switch Off Switch Off Switch Off Switch On Switch On Switch On Switch On Switch
Fourteen 0 1 4 0 0 0 0 1 1 1 0 Off Switch Off Switch Off Switch Off Switch On Switch On Switch On Switch Off Switch
Thirteen 0 1 3 0 0 0 0 1 1 0 1 Off Switch Off Switch Off Switch Off Switch On Switch On Switch Off Switch On Switch
Twelve 0 1 2 0 0 0 0 1 1 0 0 Off Switch Off Switch Off Switch Off Switch On Switch On Switch Off Switch Off Switch
Eleven 0 1 1 0 0 0 0 1 0 1 1 Off Switch Off Switch Off Switch Off Switch On Switch Off Switch On Switch On Switch
Ten 0 1 0 0 0 0 0 1 0 1 0 Off Switch Off Switch Off Switch Off Switch On Switch Off Switch On Switch Off Switch
Nine 0 0 9 0 0 0 0 1 0 0 1 Off Switch Off Switch Off Switch Off Switch On Switch Off Switch Off Switch On Switch
Eight 0 0 8 0 0 0 0 1 0 0 0 Off Switch Off Switch Off Switch Off Switch On Switch Off Switch Off Switch Off Switch
Seven 0 0 7 0 0 0 0 0 1 1 1 Off Switch Off Switch Off Switch Off Switch Off Switch On Switch On Switch On Switch
Six 0 0 6 0 0 0 0 0 1 1 0 Off Switch Off Switch Off Switch Off Switch Off Switch On Switch On Switch Off Switch
Five 0 0 5 0 0 0 0 0 1 0 1 Off Switch Off Switch Off Switch Off Switch Off Switch On Switch Off Switch On Switch
Four 0 0 4 0 0 0 0 0 1 0 0 Off Switch Off Switch Off Switch Off Switch Off Switch On Switch Off Switch Off Switch
Three 0 0 3 0 0 0 0 0 0 1 1 Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch On Switch On Switch
Two 0 0 2 0 0 0 0 0 0 1 0 Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch On Switch Off Switch
One 0 0 1 0 0 0 0 0 0 0 1 Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch On Switch
Zero 0 0 0 0 0 0 0 0 0 0 0 Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch Off Switch

The binary numbers are easy to read if you remember that each place in the binary system is worth double the column to its right. So in the 8-bit binary numbers listed above, the place values are (from left to right):

    128  64  32  16  8  4  2  1

Thus, the value five is written in the binary system as 00000101, and the binary number 01000001 represents 65 (64+1).

Program Instructions

Program instructions also require a language (set of standardized coding patterns). The language used for instructions is based on the type of processor that a computer system contains. Each brand and model of processor has its own unique machine language. The bit pattern that represents the instruction to multiply on an Intel® brand of processor (used in many popular PC's) will mean something entirely different to a Motorola® processor (used in early Apple® Macintosh™ computers). This is why programs are not universally transportable between all computers.

Machine language is not humanly readable because it is not coded using text characters, but rather as its own unique bit patterns that are recognized by processors. Some programming languages called high-level languages have been developed to allow programmers to write commands composed from standard text. Such languages are usually independent of processor type, since they are not written in a language specific to the processor. A program written in a high-level language must be translated into a specific processor's machine language before the processor can properly interpret the instructions. There are many of these languages, such as BASIC, C, COBOL, FORTRAN, Java, and Pascal.

Graphics Data

Graphics languages are used by artists, photographers, architects and other designers to compose pictures and diagrams with computers. Graphic data stores information about the position and color of each pixel (picture element, or dot). Image quality (or resolution) is based on the size and spacing of the pixels and on the quantity of colors available (called the palette or color depth). For any graphics language, the better the resolution and color depth, the more data it will take to represent the image and the bigger the resulting storage file will be.

Binary Files

In common computer terminolgy, any file that requires 8-bit coding (including files that contain extended characters - see the earlier table) is referred to as a "binary file". Aside from extended character data, binary files could also be: machine language program files, graphic data files, audio data files, etc.

IDENTIFYING DATA TYPES

It is important to know which language was used to store a file so that you can use the proper software to transmit, decipher, or view the data. Most users employ a technique when naming their files of adding an extension (suffix) to the end of each file name that helps to identify the program or data format that was used to store the file. The computer industry has standardized the use of many filename extensions so that users will be able to identify the language that was used to store the file by simply seeing its name. The use of standardized filename extensions is not a rule; it is simply a good habit for users to adopt.

Because there are so many different types of files, there are many different extensions used to identify them. Remember that the extension can help you to recognize what program can be used to manipulate a file, but it is not a guarantee of data format. The extension is just a part of the file's name, and files can be named incorrectly or contrary to standards. Some of the most common extensions (and the file types that they imply) are listed below.

Extension       File Type, Language, or Program
AU & AIF Audio and Audio Information Format respectively
BAT A BATch of hand typed instructions for the Operating System in ASCII text
BMP Bit-Mapped Graphics - a graphic language used by the MS-Paint program
COM DOS machine language COMmands (for IBM-PC compatibles)
DOC Typically, a "document" created by some word processor (often MS-Word®)
EXE EXEecutable machine language (processor dependent)
GIF Graphic Interchange Format - graphic data files
HTML & HTM HyperText Markup Language - the text-based language behind web pages
JPEG & JPG Joint Photographic Experts Group - compressed picture data format
MPEG & MPG Moving Pictures Experts Group - compressed video data format
PCX PiCture eXchange format - graphic data files
PDF [Adobe® Portable Document Format (PDF) files]
RA RealAudio® - real-time streaming audio data
RTF Rich Text Format - a common data format to make it easier to transfer data between different brands of word processing software.
TAR Tape ARchive, a collection of files grouped in one file
TXT Text - Symbols such as letters, numerals, or punctuation (see the definition above)
WAV Microsoft® WAVeform - audio data file
XLS Electronic Spreadsheet Data stored by the Microsoft® Excel program
Z UNIX Compressed File
ZIP A collection of specially compressed files

For a more detailed list of extensions and a discussion of their meanings and the program associated with them, look at the web sites [http://filext.com/] and [http://www.FileInfo.Net/]. To learn more about this topic, view the pages about [Data Formats] from [PC Webopaedia]. To learn whether your computer can display a specific data format, look at the [WWW Viewer Test Page] from the University of Wisconsin - Madison. It lists many different data formats and links to information about them.

Text Files:

ASCII text files are the simplest files to work with, because most programs can read them and almost all terminal devices can output them. They can be coded using only 7 bits per byte. So they can be transmitted over almost any data communications link, regardless of whether it uses a 7 bit or an 8 bit data byte size. The most common filename extension used to identify ASCII text files is "TXT", although some people use "ASC" also. In truth, an ASCII file could be given any extension or no extension, but most users will use the standard ones as a courtesy to other users.

Are word processed documents text files?

Some files that you might expect to be text files are actually binary files. This is the case with almost all word processed documents. Word processing software must store much more than simple typed characters in their documents. Data related to layout and print enhancements, such as typeface, type size, boldfacing, underlining, and italics, must also be stored. Since the standard 128 ASCII text characters were not enough to represent all of these concepts, an additional 128 codes have been defined by each word processor to represent them. This means that word processed document files require 8 bits per byte to store their data and can only be interpreted by the word processing program that created them. The document files are said to be proprietary (the property of the word processing program that created them).

Rich Text Files - RTF

Users have difficulty transferring word processed files between different brands of software because most word processors use proprietary data formats. The computer industry developed a common data format to address this problem and to make it easier to transfer text-based data in a common 7-bit data format. Rich Text Format or RTF is a language standard that stores information about a document's text enhancements using only the standard text characters that are found in the ASCII code. Thus, RTF files are true text files. However they contain extra characters called escape sequences that represent the text's enhancements. When a word processing program opens (reads) an RTF file it interprets these escape sequences and applies the enhancement they represent to the text. If you were to open an RTF file with a pure text editor (such as the Notepad program in Windows®) you would actually see the escape sequences. You probably would not recognize what they meant, but you would be able to read them because they are all stored as plain text. Most modern word processors can read and write Rich Text Format. Simply start to save a document using the menu choice Save as and then look for a menu named "File type" (or something similar). On it you will probably see "RTF - Rich Text Format".

Web Pages

Web pages are documents that are stored on large Internet computers called web servers and can be read by many different brands of software on computers located anywhere in the world. For this reason, the computer industry developed a common data format that was text-based (in a 7-bit data format). HyperText Markup Language or HTML is a language standard that stores information about a document's text enhancements, links to other documents, and embedded multimedia objects using only the standard text characters that are found in the ASCII code. Like RTF files, HTML files are true text files. They also contain escape sequences, called tags that represent the data mentioned above. When a web browsing program opens (reads) an HTML file it interprets the tags and applies the enhancement they represent to the web page. If you were to open an HTML file with a pure text editor (such as the Notepad program in Windows®) you would see the text and the tags. You probably would not recognize what they meant, but you would be able to read them because they are all stored as plain text. All web browsers can read and write HTML. That is their job. Nowadays, most word processors can also read and write HTML. Simply start to save a document using the menu choice Save as and then look for a menu named "File type" (or something similar). On it you look for "HTML - Web Document". Some programs offer a separate menu item of "Save as a Web Page" on their File Menu. For more information about HTML, see the web page entitled Web Page Authoring Overview.

Are all program files binary files?

No. Remember that high-level languages are composed initially of command words that are typed using ordinary text. Before these files get translated into a processor's machine language, they are simple text files. They can not be executed until they are translated. But they can be transmitted as text files, using the faster 7 bit protocol. This is a perfect situation for Internet users. They can copy a program file that was written in the original high-level source code (text), which is not specific to any processor, and then translate it into the machine code for their own style of processor. For this reason, many public domain programs are stored at Internet sites in their original source code instead of as executable machine code specific to that host's processor.

Some examples of tell-tale filename extensions found on program source code files are:

Extension File Type, Language, or Program
BAS BASIC programming language source code
C C programming language source code
CPP C++ programming language source code
COB COBOL programming language source code
FOR FORTRAN programming language source code
JS JavaScript programming language source code
PAS PASCAL programming language source code
PL Perl scripting language source code

Some non-character data is purposely represented in text (7 bit) form to take advantage of the ease with which text can be manipulated and its faster transmission speed. One example of this is the special printer language known as PostScript that was written to control high quality laser printers. PostScript files can only be interpreted by PostScript printers or special interpreting software. But they can be transmitted over the Internet using ASCII (7 bit) protocol, because they are specially coded files made up solely of ASCII characters. PostScript files are identified by the filename extension "PS". For more information on other data formats and file types, see:

[http://en.wikipedia.org/wiki/File_format]

or

[http://www.umdnj.edu/idsweb/idst3400/overview.htm]
PATH: Instructional Server> Computer Basics>