I started programming 2 years ago and there's one question that bugs me at the back of my head all time when I program and I put to silence.
I understand the basics of microprocessor architecture and low level programming and I understand there is no such thing as a data type. It's just an abstraction to limit the way data is processed and control memory resources.
So I know this is a deep and somewhat unclear question but hopefully you'll understand the piece in the puzzle missing for me to make sense of the link between high level programming and what actually goes in the hardware.
So my question is : what exactly is a data type and how, where and when is it implemented?
A data type, in programming, is a classification that specifies which type of value a variable has and what type of mathematical, relational or logical operations can be applied to it without causing an error.
Most modern computer languages recognize five basic categories of data types: Integral, Floating Point, Character, Character String, and composite types, with various specific subtypes defined within each broad category.
4 Types of Data: Nominal, Ordinal, Discrete, Continuous.
A data type is an attribute associated with a piece of data that tells a computer system how to interpret its value. Understanding data types ensures that data is collected in the preferred format and the value of each property is as expected.
A data type is an element of the semantics of a language. It is a set of rules about what kind of information can be represented by a variable in the language, and the transformations that apply to those types of information.
It is implemented in the compiler or interpretter of the language. In a compiled language, it is implemented at compile time. In an interpretted language it is implemented at run time - some of the rules being applied during the "initial parsing pass", and some being applied as the data itself is manipulated according to the semantics of the language during execution.
Elaboration in response to OP's comment:
A concrete example of what is going on might be the processing of this code, in C:
int i = "foo";
The C compiler first lexes this, and concludes it has a a keyword followed by an identifier, followed by an operator followed by a constant. Syntactically, it determines that it is an initialisation statement. It then comes to the semantic analysis and determines that it is being asked to assign a string constant to an integer variable. At this point, it concludes that this is not allowed semantically because an integer datatype is not allowed to have a string value. The C compiler issues an error statement to this effect, and produces no output code, no assembly, no binary.
The effect of the datatype was to cause compilation to cease.
The implementation of the datatype is in the C compiler itself - in the code/logic of the compiler.
You can't "see" datatypes in the "assembly code" of a program itself. They exist in the mechanism that implements the language (compiler or interpretter), not in the resulting program.
Thus there is no such thing as "a piece of assembly code illustrating a datatype".
Well thinking about types in C
In The History of the C Language it says one of reasons Dennis Ritchie made C
was becuase B
(the language much of UNIX was written in prior to C
) had really weak typing, so Dennis Ritchie "turned" the B
language into the C
language by adding types and structures.
A drawback of the B language was that it did not know data-types. (Everything was expressed in machine words). Another functionality that the B language did not provide was the use of “structures”. The lag of these things formed the reason for Dennis M. Ritchie to develop the programming language C.
I'll try and quickly cover this..
Looking at a typical x86 32bit register, eax
for instance, you have;
00-00-00-f0h <- A bit-mask just to add some bits
expands to;
**** <-- [nybble] 4 bits
0000-0000 0000-0000 0000-0000 1111-0000b
^^^^^^^^^
^ah ^^^^^^^^^ <-- [byte] 8 bits
^al
^^^^^^^^^^^^^^^^^^^^ <-- [word] 16 bits
^ax
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <-- [dword] 32 bits
^eax
dword word byte
are the sizes you can actually manipulate with instructions, these serve as (in a way) really basic types in assembly level programming, but these are just sizes, this isn't enough, we would like to have types to represent all kinds of things, not just sizes of data, for instance characters
, how can we tell that a bit pattern is a number
or a string of characters
, better yet how can we tell if a number
is signed
or unsigned
, you can't, a particular bit pattern only makes sense in whatever context you're using it in, this can lead to bugs, and confusing code, so higher level languages implement types to help make the data retain meaning and help to prevent hard to find bugs.
In C
, say we have the string of characters of type char
equal to "hello world", type char *
, if we open this in a debugger and examine first some of the instructions and the memory, perhaps we can make some sense of this.
using GDB
to examine the first 8 instructions in the main
function we get;
(gdb) x/8i $eip
=> 0x4015d3 <main+3>: and esp,0xfffffff0
0x4015d6 <main+6>: sub esp,0x10
0x4015d9 <main+9>: call 0x401ff0 <__main>
0x4015de <main+14>: mov DWORD PTR [esp+0xc],0x409064
0x4015e6 <main+22>: mov eax,0x0
0x4015eb <main+27>: leave
0x4015ec <main+28>: ret
0x4015ed <main+29>: nop
take notice of this mov DWORD PTR [esp+0xc],0x409064
What's this address (0x409064
) being moved into to stack (esp+0xc
)?
Well if we examine that address we get;
(gdb) x/s 0x409064
0x409064 <__register_frame_info+4231268>: "hello world"
that's the address where our string starts in memory, so when we create a type char *
in C
we are really storing a pointer to the data onto the stack, then when we reference that type we just need to grab the address for it off of the stack, the good thing about addresses is we don't need more than 32 bits (dword
) for each address on the stack, regardless of the types size.
I could assume that C
does the same thing when we create a single i.e char ch = 'a'
, lets check;
(gdb) x/8i $eip
=> 0x4015d3 <main+3>: and esp,0xfffffff0
0x4015d6 <main+6>: sub esp,0x10
0x4015d9 <main+9>: call 0x402000 <__main>
0x4015de <main+14>: mov DWORD PTR [esp+0xc],0x409064
0x4015e6 <main+22>: mov BYTE PTR [esp+0xb],0x61
0x4015eb <main+27>: mov eax,0x0
0x4015f0 <main+32>: leave
0x4015f1 <main+33>: ret
NO it doesn't store a pointer on the stack
well that changes that, lets quickly examine the stack at a point after the variables have been pushed onto the stack;
note: gdb calls words
what i called a dword
, so when i ask for 5 hex words (5xw)
i mean 5 hex dwords
which is what i get.
(gdb) x/5xw $esp
0x28fea0: 0x00401f80 0x00000000 0x61000023 0x00409064
0x28feb0: 0x00000023
Look at the last two dwords
on the first line 0x61000023 & 0x00409064
:
0x00409064 is the address to our data (char *)
0x61000023 this dword needs to loose a few bytes to make sense. ignoring 000024
we are left with 0x61
the ascii value for 'a'.
The compiler has stored the 'a' | 0x61 as the data itself right next to our string on the stack esp+0xb = char
and esp+0xc = (char *)
, as you can see (similar to assembly) types in C
are closely related to sizes and a lot of the work is done by the compiler, if sizes of types are hard to determine C
seems to use pointers (which are the size of a register), otherwise if it's a type whos size can be determined, the compiler just puts the data right on the stack.
(by determine i mean control)
And from all that i'v only examined char
s!!!!
I'm sure their are lots and lots of other ways in C
alone that types are implemented not thinking about all the other languages that exist and all the different ways they might do it.
Anyway I hope that helps you out some and i didn't mess anything up.
Extra Info:
Doing a quick search for compiler design
i found this pdf
For information about any language, i feel i should refer you to its standard;
here is C's standard
Another quick way to find information about a language is;
do google search
for [x language's] documentation
For information specifically about types
i found this paper.
How i found the last paper is another good way to find information;
do a wiki search
for whatever you're looking for and check at the bottom of the page for further reading
and whatever references are on the page.
Now about the assembly code part;
You can and should use debuggers and examine how things work yourself.
This guide called Beej's quick guide to GDB looks like a pretty good start to GDB
A quick way: including the -S
flag when you compile a C
program in gcc
will give you the actual assembly code listing for a program;
i.e gcc -S file.c
will give you file.s
filled with the assembly code, add the -masm=intel
flag to change the syntax from AT&T's to Intel's.
Just remember the compiler doesn't try to write your programs so a human will understand them, so things will probably look a little crazy to you at first!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With