Tag: C

  • C Source code to Executable Instructions

    C Source code to Executable Instructions

    What does this post help you with ?

    Microprocessors only deal with 1s and 0s, and execute instructions at a speed unit of measurement (which is a topic of itself, performance of a processor is also a different topic), we just keep it simple to understand the journey of the source code.

    How does a piece of code written in C become instructions that a processor executes ?

    Let’s assume we are using a Linux based system with GCC to make this journey. The idea is to get a quick introduction, and this is stuff I have gathered over a period of time, so there may be some mistakes (disclaimer)

    High Level Steps

    Compile

    You need a compiler like GCC. What the compiler does : Reads the c code, and creates assembly code. Assembly code is the language that abstracts the actual microprocessor instructions from us. Its still another language sometimes called ASM, not the actual instructions that run on the microprocessor. Why is this step here ? High level language needs to be translated to a lower level language is the bottom line. Optimizations, being able to port the same code to multiple hardware targets (cross compilation), and there are many more reasons.

    GCC has 15Million+ LOC (mostly written in C, about 4GB, as of Jan 2025)

    Compiler does a lot of transformations, and optimizations before it spits out assembly code for the target hardware (the processor and the “Digital Computer Architecture”).

    Compiler internals – Logical units of a compiler – front-end, middle-end, and back-end processing software components.

    Compiler Front End – will read the source code (PARSER), and generate intermediate representation for optimizations

    AST – Abstract Syntax Tree is what the code gets “parsed” into, and the AST represents the entire program code with the nodes and edges representing the syntactic elements of C.

    (GENERIC. GIMPLE - a machine independent representation and
    RTL (Register Transfer Language) used to write machine description - usually determined by hardware architecture x86_64, Aarch64, RISCV, ARM etc)

    Compiler Middle End – 100s of optimizations

    The backend of the compiler – Linker that generates the actual assembly code

    The control flow graph connects these three logical layers of the compiler

    So concepts are good, how do I “see” things. Write your simple helloworld in C first.

    gcc -o helloworld.asm -S helloworld.c
    gedit helloworld.asm

    Assemble

    Use the assembler to generate instructions that can be executed. Okay here, Assembler, takes the different pieces of code you write in different files, folders, and convert them all into a files with instructions. So the Assembler just takes your code, and converts “just” key point, your code.

    Time to check things out for yourself

    as --version
    as -o helloworld.obj helloworld.asm
    objdump -d helloworld.obj

    Link

    Use the linker to do the actual linking and generating the binary executable. Why is this needed ? In your code you have not actually included so much other code that is needed to actually do what you have asked to do. For example, you want to print “Hello World” on the screen. You did not write code to call the display driver and ask it to do the work for you or even make an Operating System abstraction call to do that, you just simply said print ! The linker now has to find what all code (let’s say libraries here) needs to be added to your code to make it do the actual thing. Here is where libc comes into play. Libc is this massive library that does a lot of this for you. Its worth mentioning, as at least for me, that was a missing link, as I want to know all the dots that can connect to make a line, especially the last one.

    here is what you can do in your computer

    gcc -v -o helloworld helloworld.obj

    Linker her is actually not GCC, GCC just calls a program called collect2 which is the linker; and the actual call it made for me:

    /usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper -plugin-opt=-fresolution=/tmp/ccnGfceH.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o helloworld /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. helloworld.obj -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o

    so here – the linker links to the libc so file – without that a binary cannot be produced.

    good files to review in gcc source code https://github.com/gcc-mirror/gcc.git

    git clone https://github.com/gcc-mirror/gcc.git
    cd gcc

    look for these files

    rtl.def ./gcc/rtl.def gimple.def ./gcc/gimple.def passes.def ./gcc/passes.def RTL based machine descriptors are in *.md files (not to be confused with Markdown files)

    Hope this helps you with whatever you were trying to get out of this page – especially if you are reading this line – leave a comment below, if you found some in accuracy or would like to know something else or just say something