Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is data type and how is it implemented?

I started programming 2 years ago and there's one question that bugs me at the back of my head all time when I program and I put to silence.

I understand the basics of microprocessor architecture and low level programming and I understand there is no such thing as a data type. It's just an abstraction to limit the way data is processed and control memory resources.

So I know this is a deep and somewhat unclear question but hopefully you'll understand the piece in the puzzle missing for me to make sense of the link between high level programming and what actually goes in the hardware.

So my question is : what exactly is a data type and how, where and when is it implemented?

like image 225
Ethienne Avatar asked Jan 25 '14 22:01

Ethienne


People also ask

What is data type explain it?

A data type, in programming, is a classification that specifies which type of value a variable has and what type of mathematical, relational or logical operations can be applied to it without causing an error.

What are the 5 data types?

Most modern computer languages recognize five basic categories of data types: Integral, Floating Point, Character, Character String, and composite types, with various specific subtypes defined within each broad category.

What are the 4 main data types?

4 Types of Data: Nominal, Ordinal, Discrete, Continuous.

What is data type and why do we need it?

A data type is an attribute associated with a piece of data that tells a computer system how to interpret its value. Understanding data types ensures that data is collected in the preferred format and the value of each property is as expected.


2 Answers

A data type is an element of the semantics of a language. It is a set of rules about what kind of information can be represented by a variable in the language, and the transformations that apply to those types of information.

It is implemented in the compiler or interpretter of the language. In a compiled language, it is implemented at compile time. In an interpretted language it is implemented at run time - some of the rules being applied during the "initial parsing pass", and some being applied as the data itself is manipulated according to the semantics of the language during execution.


Elaboration in response to OP's comment:

A concrete example of what is going on might be the processing of this code, in C:

int i = "foo";

The C compiler first lexes this, and concludes it has a a keyword followed by an identifier, followed by an operator followed by a constant. Syntactically, it determines that it is an initialisation statement. It then comes to the semantic analysis and determines that it is being asked to assign a string constant to an integer variable. At this point, it concludes that this is not allowed semantically because an integer datatype is not allowed to have a string value. The C compiler issues an error statement to this effect, and produces no output code, no assembly, no binary.

The effect of the datatype was to cause compilation to cease.

The implementation of the datatype is in the C compiler itself - in the code/logic of the compiler.

You can't "see" datatypes in the "assembly code" of a program itself. They exist in the mechanism that implements the language (compiler or interpretter), not in the resulting program.

Thus there is no such thing as "a piece of assembly code illustrating a datatype".

like image 64
GreenAsJade Avatar answered Sep 30 '22 09:09

GreenAsJade


Well thinking about types in C

In The History of the C Language it says one of reasons Dennis Ritchie made C was becuase B (the language much of UNIX was written in prior to C) had really weak typing, so Dennis Ritchie "turned" the B language into the C language by adding types and structures.

A drawback of the B language was that it did not know data-types. (Everything was expressed in machine words). Another functionality that the B language did not provide was the use of “structures”. The lag of these things formed the reason for Dennis M. Ritchie to develop the programming language C.

I'll try and quickly cover this..

Looking at a typical x86 32bit register, eax for instance, you have;

00-00-00-f0h <- A bit-mask just to add some bits

expands to;

                                   ****  <-- [nybble] 4 bits
0000-0000 0000-0000 0000-0000 1111-0000b
                    ^^^^^^^^^             
                    ^ah       ^^^^^^^^^  <-- [byte] 8 bits
                              ^al
                    ^^^^^^^^^^^^^^^^^^^^ <-- [word] 16 bits
                    ^ax
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <-- [dword] 32 bits
^eax

dword word byte are the sizes you can actually manipulate with instructions, these serve as (in a way) really basic types in assembly level programming, but these are just sizes, this isn't enough, we would like to have types to represent all kinds of things, not just sizes of data, for instance characters, how can we tell that a bit pattern is a number or a string of characters, better yet how can we tell if a number is signed or unsigned, you can't, a particular bit pattern only makes sense in whatever context you're using it in, this can lead to bugs, and confusing code, so higher level languages implement types to help make the data retain meaning and help to prevent hard to find bugs.

In C, say we have the string of characters of type char equal to "hello world", type char *, if we open this in a debugger and examine first some of the instructions and the memory, perhaps we can make some sense of this.

using GDB to examine the first 8 instructions in the main function we get;

(gdb) x/8i $eip
=> 0x4015d3 <main+3>:   and    esp,0xfffffff0
   0x4015d6 <main+6>:   sub    esp,0x10
   0x4015d9 <main+9>:   call   0x401ff0 <__main>
   0x4015de <main+14>:  mov    DWORD PTR [esp+0xc],0x409064
   0x4015e6 <main+22>:  mov    eax,0x0
   0x4015eb <main+27>:  leave
   0x4015ec <main+28>:  ret
   0x4015ed <main+29>:  nop

take notice of this mov DWORD PTR [esp+0xc],0x409064

What's this address (0x409064) being moved into to stack (esp+0xc)?

Well if we examine that address we get;

(gdb) x/s 0x409064
0x409064 <__register_frame_info+4231268>:       "hello world"

that's the address where our string starts in memory, so when we create a type char * in C we are really storing a pointer to the data onto the stack, then when we reference that type we just need to grab the address for it off of the stack, the good thing about addresses is we don't need more than 32 bits (dword) for each address on the stack, regardless of the types size.

I could assume that C does the same thing when we create a single i.e char ch = 'a', lets check;

(gdb) x/8i $eip
=> 0x4015d3 <main+3>:   and    esp,0xfffffff0
   0x4015d6 <main+6>:   sub    esp,0x10
   0x4015d9 <main+9>:   call   0x402000 <__main>
   0x4015de <main+14>:  mov    DWORD PTR [esp+0xc],0x409064
   0x4015e6 <main+22>:  mov    BYTE PTR [esp+0xb],0x61
   0x4015eb <main+27>:  mov    eax,0x0
   0x4015f0 <main+32>:  leave
   0x4015f1 <main+33>:  ret

NO it doesn't store a pointer on the stack

well that changes that, lets quickly examine the stack at a point after the variables have been pushed onto the stack;

note: gdb calls words what i called a dword, so when i ask for 5 hex words (5xw) i mean 5 hex dwords which is what i get.

(gdb) x/5xw $esp
0x28fea0:       0x00401f80      0x00000000      0x61000023      0x00409064
0x28feb0:       0x00000023

Look at the last two dwords on the first line 0x61000023 & 0x00409064:

0x00409064 is the address to our data (char *)

0x61000023 this dword needs to loose a few bytes to make sense. ignoring 000024 we are left with 0x61 the ascii value for 'a'.

The compiler has stored the 'a' | 0x61 as the data itself right next to our string on the stack esp+0xb = char and esp+0xc = (char *), as you can see (similar to assembly) types in C are closely related to sizes and a lot of the work is done by the compiler, if sizes of types are hard to determine C seems to use pointers (which are the size of a register), otherwise if it's a type whos size can be determined, the compiler just puts the data right on the stack.

(by determine i mean control)

And from all that i'v only examined chars!!!!

I'm sure their are lots and lots of other ways in C alone that types are implemented not thinking about all the other languages that exist and all the different ways they might do it.

Anyway I hope that helps you out some and i didn't mess anything up.


Extra Info:

Doing a quick search for compiler design i found this pdf

For information about any language, i feel i should refer you to its standard;
here is C's standard

Another quick way to find information about a language is;
do google search for [x language's] documentation

For information specifically about types i found this paper.

How i found the last paper is another good way to find information;
do a wiki search for whatever you're looking for and check at the bottom of the page for further reading and whatever references are on the page.

Now about the assembly code part;

You can and should use debuggers and examine how things work yourself. This guide called Beej's quick guide to GDB looks like a pretty good start to GDB

A quick way: including the -S flag when you compile a C program in gcc will give you the actual assembly code listing for a program;

i.e gcc -S file.c will give you file.s filled with the assembly code, add the -masm=intel flag to change the syntax from AT&T's to Intel's.

Just remember the compiler doesn't try to write your programs so a human will understand them, so things will probably look a little crazy to you at first!

like image 38
James Avatar answered Sep 30 '22 09:09

James