Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Bytecode and Assembly Language the same thing?

The question might seem odd, but I am still trying to grasp the concepts of virtual machines. I have read several answers, but I still don't get if Java bytecode (and MSIL as well) is the same as assembly language. As far as I understand both bytecode and assembly gets compiled to machine code, so speaking in terms of abstraction they are at the same level, i.e. one step above machine code. So is bytecode just an assembly language, i.e. a human readable form of machine code. If yes, then why is assembly language still used? Why not programming in bytecode (which is portable across different machines) instead of assembly language (which is specific to a single machine architecture)? Thanks

like image 931
Gianluca John Massimiani Avatar asked Aug 22 '16 16:08

Gianluca John Massimiani


People also ask

What is the difference between machine code and byte code?

Machine code is a set of instructions in machine language or in binary format and it is directly executed by CPU. 04. Byte code is executed by the virtual machine then the Central Processing Unit.

What is the difference between bytecode and assembly?

So, bytecode is a machine-friendly representation of the program in form of sequence of bits. The problem of bytecode is that while it extremely convenient for machine handling at the same time it extremely inconvenient for handling by humans. Assembly language provides a text-based and thus human-friendly equivalent of bytecode.

What is the difference between machine code and assembly language?

Considering the usage, the CPU can directly execute the machine code to perform the defined tasks in the computer program. On the other hand, real-time systems, and microcontroller-based embedded systems are some examples of applications using assembly language. In brief, assembly language is one level ahead of machine code.

What is bytecode?

Bytecode is a simplified binary language similarly to machine code. Bytecode specification describes how the program should be encoded to assure that virtual machine will correctly understand and execute it.


3 Answers

Bytecode and the assembly language are not the same things but they are a tightly related things.

Bytecode is a simplified binary language similarly to machine code. Bytecode specification describes how the program should be encoded to assure that virtual machine will correctly understand and execute it. In the same way processor specification describes so called Instruction Set Architecture (ISA) that shows how the program should be encoded in the binary machine code to assure that processor will correctly understand and execute it. So, bytecode is a machine-friendly representation of the program in form of sequence of bits.

The problem of bytecode is that while it extremely convenient for machine handling at the same time it extremely inconvenient for handling by humans. Assembly language provides a text-based and thus human-friendly equivalent of bytecode. Actually, assembly language establish the 1-to-1 mappings between instructions of bytecode in binary form and their text equivalents providing a convenient way for a programmer to read, understand and write programs in the particular bytecode (for particular processor or virtual machine). In other words both bytecode and assembly language describe the program on the same level of abstraction but in different terms.

The strict 1-to-1 mappings between bytecode instruction and statement in assembly language allow easy and unambiguous conversion of the program from the binary form to the text form and vice versa. As you could note there is a bunch of disassemblers which allow engineers to take a look under the hood of already compiled applications by converting them from the bytecode binary into the assembly language text.

The conversion of assembly text into the bytecode requires compilation. But in contrast to high-level programming languages, compilation of assembly text is very simple. Assembler consumes the program text in the statement-by-statement way. Usually assembly language specifies that each statement must be placed in a separate line of program text, hence, assembler consumes that text line by line. From each line it extract a sequence of words and punctuation characters ignoring comments and uses that set of words as a key in the mapping table to find equivalent sequence of binary bytes that represent the same instruction. That sequence of bytes is placed into the bytecode of the program. Actually, to eliminate overhead related to text parsing Java uses bytecode and does not compiles machine code directly from the assembly text during JITing.

Also, in contrast to high-level languages, compilation of bytecode from assembly language does not require complex syntax (building the abstract syntax tree) and semantic analysis as well as it does not perform optimization of produced bytecode. Assemblers are very simple in comparison to modern compilers. And in contrast to high-level programming languages, assembly language is always linked to the particular bytecode, thus to the particular processor or virtual machine. High-level languages was initially introduced as a mean of portability of programs and hence they are designed to be enough general. In contrast, programs in assembly languages are not portable, but on the other hand they provide programmers full access to the all features of the respective processor or virtual machine, while at the same time many of them are not accessible in the high-level language.

The idea employed by such programming languages as Java and C# is to preserve the portability of high-level languages but minimize the overhead of interpretation/compilation required to execute program. Because of this, they employ the virtual machines and bytecodes.

Note, that the same bytecode can be supported by multiple assembly languages, because there are could be multiple dictionaries of 1-to-1 mapping between the same instructions of bytecode to the different text strings corresponding to them. Each assembly language can provide its own variant of sequence of words to describe the same instruction in binary form. For example, take a look at the x86 assemblers. Intel uses one notation, Microsoft other notation, finally GNU assembler uses completely another notation. But all them compiles to the same machine code.

like image 128
ZarathustrA Avatar answered Sep 29 '22 18:09

ZarathustrA


No.

Java bytecode is binary programming language, not in "human readable form", unless you consider bunch of number readable, or you use disassembler to reverse it into the bytecode text mnemonics, or eventually the Java source form itself.

Assembly is usually text mnemonics of the actual instructions of the target machine, mapped 1:1 with each other, so one instruction in assembler source will translate directly into one machine code instruction (although some exceptions exists with some CPUs and assemblers, like for example many RISC assemblers will translate "load register with immediate value" into multiple instructions as needed - to load any immediate value, while the native machine code can load only particular bits, and you have to compose the whole value by several instructions).

Java bytecode is quite high-level abstraction language compared to most of CPUs machine codes, having very tiny overlap of the instructions and memory model. The only similarity is, that bytecode is stored in binary form, just like machine code.


edit:

The JVM is interpreter in principle, ie. it translates the bytecode on the fly into machine code. That's the thing, which is done in other languages by compiler during compile time.

The modern JVMs are not classic pure interpreters, but use "JIT" (Just In Time) compiler to compile small pieces of java bytecode into native machine code, just ahead of it's execution, using caches to avoid second compilation of already known .class files, and also using runtime tracking of performance data to better instruct JIT compiler, which bytecode should be optimized heavily (run often or inner loop), and which should be just compiled ASAP, without focus on performance.

So with modern JVM it's hard to talk about interpreters, it's quite sophisticated and complex solution. C# goes quite often even one step further, delivering sometimes part of binaries pre-compiled into machine code for common platforms (having the bytecode form only as an fallback for uncommon platforms).

None of this (not even similar) happens with machine code. It just executes on the CPU.

like image 7
Ped7g Avatar answered Oct 18 '22 22:10

Ped7g


An assembly language is a human-readable text language designed to be assembled into a binary. Each source line maps directly to one chunk of binary output (e.g. one variable-length x86 instruction), without depending on previous lines. (I'm not sure if Java bytecode asm is context-sensitive; I haven't used it).

e.g. mov eax, 1234 assembles to the same 5 bytes regardless of what other source lines surround it. (Ignoring named constants and assembler macros, of course).

The default meaning of "assembly language" (the one described the assembly tag wiki) is CPU machine-code assembly language, where the bytes being assembled into the output file are instructions and data for a native executable for some kind of CPU / microprocessor.

Other kinds of assembly languages exist, like java bytecode assembly where the bytes assembled into the output file are in Java .class format, and can be run by a JVM. (@Ped7g's answer expands on this point, about how a JVM can optimize while translating Java bytecode into native machine code. This process is definitely not like assembling.)

It's all just text language to cause the assembler to assemble bytes into the output file.


You could have an assembly language for any kind of binary file format, even non-executable ones. A simple example: an assembly language for a bitmap still-image file format, where you can use named colours (like midnight blue) for each pixel. The assembler would assemble bits (instead of only whole bytes like normal assembly languages) into the output file.

In a more complex case, you could imagine an H.264 assembly language, where you use a text syntax to describe the coding of headers and each macroblock.

In this case, you'd design the assembler to do the final CABAC or CAVLC compression of the assembled macroblock data into a bitstream, instead of describing that as part of the assembly language. It would be like an x86 assembler that produced gzipped binaries: assemble into a deflate stream.


One key feature of an assembly language is that it's close enough to the machine-code format that a disassembler can turn a binary back into asm that looks like what was assembled in the first place (but without any comments, label names, or macros, of course).

This is why C and Java are considered higher level languages than the binary/assembly their compilers produce as output.

like image 2
Peter Cordes Avatar answered Oct 18 '22 21:10

Peter Cordes