All Projects → jnk0le → cortexm-AES

jnk0le / cortexm-AES

Licence: MIT license
high performance AES implementations optimized for cortex-m microcontrollers

Programming Languages

assembly
5116 projects
C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to cortexm-AES

zmu
zmu - Emulator for Microcontroller Systems
Stars: ✭ 70 (+288.89%)
Mutual labels:  cortex-m, cortex-m4, cortex-m3, cortex-m0
ez-rtos
A micro real-time operating system supporting task switching, delay function, memory allocator and critical section. It is writen on ARM Cortex-M3 assemble language, it runs successfully on STM32F103 MCU.
Stars: ✭ 57 (+216.67%)
Mutual labels:  arm, cortex-m, cortex-m3
utest
Lightweight unit testing framework for C/C++ projects. Suitable for embedded devices.
Stars: ✭ 18 (+0%)
Mutual labels:  arm, microcontrollers, cortex-m
stm32f103xx
DEPRECATED
Stars: ✭ 31 (+72.22%)
Mutual labels:  arm, cortex-m
Reverse Engineering Tutorials
Some Reverse Engineering Tutorials for Beginners
Stars: ✭ 217 (+1105.56%)
Mutual labels:  arm, asm
Arm Vo
Efficient monocular visual odometry for ground vehicles on ARM processors
Stars: ✭ 115 (+538.89%)
Mutual labels:  fast, arm
ZipArchive
A single-class pure VB6 library for zip with ASM speed
Stars: ✭ 38 (+111.11%)
Mutual labels:  aes, asm
alloc-cortex-m
A heap allocator for Cortex-M processors
Stars: ✭ 139 (+672.22%)
Mutual labels:  arm, cortex-m
WeightedRandomSelector
Very fast C# class for weighted random picking.
Stars: ✭ 117 (+550%)
Mutual labels:  fast, optimized
JBC SolderingStation
JBC_SolderingStation
Stars: ✭ 63 (+250%)
Mutual labels:  arm, cortex-m4
DemOS
Free, simple, extremely lightweight, stackless, cooperative, co-routine system (OS) for microcontrollers
Stars: ✭ 18 (+0%)
Mutual labels:  arm, cortex-m
TMcuSys
🍆 STM32平台uCos与emWin练习项目。图片、音乐、视频、游戏、IAP运行器。
Stars: ✭ 25 (+38.89%)
Mutual labels:  arm, cortex-m
Xpcc
DEPRECATED, use our successor library https://modm.io instead
Stars: ✭ 177 (+883.33%)
Mutual labels:  arm, cortex-m
Arm Cmake Toolchains
CMake toolchain configurations for ARM
Stars: ✭ 148 (+722.22%)
Mutual labels:  arm, cortex-m
Ataraxia
Simple and lightweight source-based multi-platform Linux distribution with musl libc.
Stars: ✭ 226 (+1155.56%)
Mutual labels:  fast, arm
Erika3
ERIKA Enterprise v3 RTOS
Stars: ✭ 98 (+444.44%)
Mutual labels:  arm, cortex-m
Daplink
Stars: ✭ 1,162 (+6355.56%)
Mutual labels:  arm, cortex-m
Rt Thread
RT-Thread is an open source IoT operating system.
Stars: ✭ 6,466 (+35822.22%)
Mutual labels:  arm, cortex-m
Cmbacktrace
Advanced fault backtrace library for ARM Cortex-M series MCU | ARM Cortex-M 系列 MCU 错误追踪库
Stars: ✭ 833 (+4527.78%)
Mutual labels:  arm, cortex-m
embedded-in-rust
A blog about Rust and embedded stuff
Stars: ✭ 49 (+172.22%)
Mutual labels:  arm, cortex-m

cortexm AES

FIPS 197 compliant software AES implementation optimized for real world cortex-m microcontrollers.

build

Repository root directory is expected to be the only include path.

If repo is added as eclipse linked folder the root folder has to be added to ASM, C and CPP include paths (-I) (proj preporties -> C++ build -> settings)

Includes also have to start from root (e.g. #include <aes/cipher.hpp>)

No cmake yet.

notes

  • Do not use ECB cipher mode for any serious encryption.
  • Do not blindly trust in timming constantness of LUT based ciphers since it depends on many factors that are unknown or just implementation defined like section placement or pipeline suprises (you need to verify it, especially where is .data section).
  • LUT tables have to be placed in deterministic memory section, usally TCMs and non-waitstated SRAMs (by default it lands in .data section)
  • FLASH memory is unsafe even on simplest cortex m0(+) as there might be a prefetcher with a few entry cache (like stm32f0/l0)
  • None of the currently available implementations protects against power/EMI analysis attacks.
  • do not use cortex-m3 and cortex-m4 implementations on cortex-m7 since it is slower and will introduce timming leaks.
  • Unrolled ciphers might perform slower than looped versions due to (usually LRU) cache pressure and flash waitstates. (like STM32F4 with 1K ART cache and up to 8WS)
  • input/output buffers might have to be word aligned due to use of ldm,stm,ldrd and strd instructions.
  • for optimization gimmicks refer to pipeline cycle test repo
  • included unit tests don't cover timming leaks (performance difference on different runs may not be a data dependent ones)
  • asm functions (and CM*.h headers) can be extracted and used as C only code, but that may require extra boilerplate code (structures etc.)

base implementations

cortex-m0/m0+

CM0_sBOX

Uses simple sbox with parallel mixcolumns

Forward mixcolumns is done as (and according to this or this paper, can be done with 3 xor + 3 rotations or 4 xor + 2 rotations as used here):

tmp = s0 ^ s1 ^ s2 ^ s3
s0` ^= tmp ^ gmul2(s0^s1) // s1^s2^s3^gmul2(s0^s1)
s1` ^= tmp ^ gmul2(s1^s2) // s0^s2^s3^gmul2(s1^s2)
s2` ^= tmp ^ gmul2(s2^s3) // s0^s1^s3^gmul2(s2^s3)
S3` ^= tmp ^ gmul2(s3^s0) // s0^s1^s2^gmul2(s3^s0)

Inverse mixcolums is implemented as:

S{2} = gmul2(S{1})
S{4} = gmul2(S{2})
S{8} = gmul2(S{4})

S{9} = S{8} ^ S{1}
S{b} = S{9} ^ S{2}
S{d} = S{9} ^ S{4}
S{e} = S{8} ^ S{4} ^ S{2}

out = S{e} ^ ror8(S{b}) ^ ror16(S{d}) ^ ror24(S{9})
	
s0{e}^s1{b}^s2{d}^s3{9} | s1{e}^s2{b}^s3{d}^s0{9} | s2{e}^s3{b}^s0{d}^s1{9} | s3{e}^s0{b}^s1{d}^s2{9}

gmul2() is implementend as:

mask = in & 0x80808080;
out = ((in & 0x7f7f7f7f) << 1) ^ ((mask - (mask >> 7)) & 0x1b1b1b1b);

CM0_FASTMULsBOX

Faster than CM0sBOX only when running on core with single cycle multiplier (used for predicated reduction in mixcolumns multiplication)

Implemented similarly to CM0sBOX but with gmul2() implementend as:

out = ((in & 0x7f7f7f7f) << 1) ^ (((in & 0x80808080) >> 7)) * 0x1b);

// or equivalent sequence to perform shifts first in order to avoid extra moves
out = ((in << 1) & 0xfefefefe) ^ (((in >> 7) & 0x01010101) * 0x1b)

performance

Cipher function STM32F0 (0ws/1ws) - CM0_sBOX STM32F0 (0ws/1ws) - CM0_FASTMULsBOX STM32L0 (0ws/1ws) - CM0_sBOX STM32L0 (0ws/1ws) - CM0_FASTMULsBOX
setEncKey<128> 399/415 (sBOX)
setEncKey<192> 375/389 (sBOX)
setEncKey<256> 568/586 (sBOX)
encrypt<128> 1666/1680 1587/1600
encrypt<192> 2000/2016 1905/1920
encrypt<256> 2334/2352 2223/2240
setDecKey<128> 0 0 0 0
setDecKey<192> 0 0 0 0
setDecKey<256> 0 0 0 0
decrypt<128> 2567/2580 2387/2400
decrypt<192> 3099/3114 2879/2894
decrypt<256> 3631/3648 3371/3388

STM32F0 is cortex-m0 (prefetch enabled for 1ws, no prefetch leads to ~45% performance degradation)

STM32L0 is cortex-m0+ (prefetch enabled for 1ws)

specific function sizes

Function code size in bytes stack usage in bytes notes
CM0_sBOX_AES_128_keyschedule_enc 80 16 uses sbox table
CM0_sBOX_AES_192_keyschedule_enc 88 20(24) uses sbox table
CM0_sBOX_AES_256_keyschedule_enc 164 32 uses sbox table
CM0_sBOX_AES_encrypt 508 40 uses sbox table
CM0_sBOX_AES_decrypt 712 40 uses inv_sbox table
CM0_FASTMULsBOX_AES_encrypt 480 36(40) uses sbox table, requires single cycle multiplier
CM0_FASTMULsBOX_AES_decrypt 672 40 uses inv_sbox table, requires single cycle multiplier

code sizes include pc-rel constants and their padding

extra 4 bytes on stack comes from aligning stack to 8 bytes on ISR entry.

cortex-m3/m4

CM3_1T

can be used on cortex-m3 and cortex m4

CM3_1T_unrolled

Same as CM3_1T but uses unrollend enc/dec functions

CM3_1T_deconly

Same as CM3_1T. Uses sbox table in key expansions instead of Te2 to reduce pressure on SRAM for decryption only use cases

CM3_1T_unrolled_deconly

Same as CM3_1T_deconly but uses unrollend enc/dec functions

CM3_sBOX

TBD

CM4_DSPsBOX

performance

Cipher function STM32F1 (0ws/2ws) - CM3_1T STM32F1 (0ws/2ws) - CM3_sBOX STM32F4 (0ws/5ws) - CM3_1T STM32F4 - CM4_DSPsBOX
setEncKey<128> 302/358 302 302
setEncKey<192> 276/311 276 277
setEncKey<256> 378/485 379 381
encrypt<128> 646/884 645 852
encrypt<192> 766/1049 765 1020
encrypt<256> 886/1217 887 1188
encrypt_unrolled<128> 603/836 602/779 -
encrypt_unrolled<192> 713/990 712/922 -
encrypt_unrolled<256> 823/1148 822/1067 -
setDecKey<128> 813/1101 0 811 0
setDecKey<192> 987/1341 0 987 0
setDecKey<256> 1163/1580 0 1164 0
decrypt<128> 651/901 650 1249
decrypt<192> 771/1072 770 1505
decrypt<256> 891/1242 892 1759
decrypt_unrolled<128> 606/847 604/785 -
decrypt_unrolled<192> 716/1003 714/928 -
decrypt_unrolled<256> 826/1159 824/1073 -

results assume that input, expanded round key and stack lie in the same memory block (e.g. SRAM1 vs SRAM2 and CCM on f407)

specific function sizes

Function code size in bytes stack usage in bytes notes
CM3_1T_AES_128_keyschedule_enc 100 24 uses Te2 table
CM3_1T_AES_192_keyschedule_enc 100 32 uses Te2 table
CM3_1T_AES_256_keyschedule_enc 178 44(48) uses Te2 table
CM3_1T_AES_keyschedule_dec 92 12(16) uses Te2 and Td2 table
CM3_1T_AES_keyschedule_dec_noTe 86 12(16) uses sbox and Td2 table
CM3_1T_AES_encrypt 434 44(48) uses Te2 table
CM3_1T_AES_decrypt 450 44(48) uses Td2 and inv_sbox table
CM3_1T_AES_128_encrypt_unrolled 1978 40 uses Te2 table
CM3_1T_AES_128_decrypt_unrolled 1996 40 uses Td2 and inv_sbox table
CM3_1T_AES_192_encrypt_unrolled 2362 40 uses Te2 table
CM3_1T_AES_192_decrypt_unrolled 2380 40 uses Td2 and inv_sbox table
CM3_1T_AES_256_encrypt_unrolled 2746 40 uses Te2 table
CM3_1T_AES_256_decrypt_unrolled 2764 40 uses Td2 and inv_sbox table
CM3_sBOX_AES_128_keyschedule_enc 100 24 uses sbox table
CM3_sBOX_AES_192_keyschedule_enc 100 32 uses sbox table
CM3_sBOX_AES_256_keyschedule_enc 178 44(48) uses sbox table
CM4_DSPsBOX_AES_encrypt 494 44(48) uses sbox table
CM4_DSPsBOX_AES_decrypt 630 44(48) uses inv_sbox table

extra 4 bytes on stack comes from aligning stack to 8 bytes on ISR entry.

cortex-m7

TBD

performance

Cipher function STM32H7 - CM7_1T STM32H7 - CM7_DSPsBOX
setEncKey<128> 141 141
setEncKey<192> 131 131
setEncKey<256> 180 180
encrypt<128> 292 400
encrypt<192> 346 478
encrypt<256> 400 556
setDecKey<128> 357 357
setDecKey<192> 433 433
setDecKey<256> 509 509
decrypt<128> 293 (1T)
decrypt<192> 347 (1T)
decrypt<256> 401 (1T)

cm7 runtime cycles are biased a bit by caller or around caller code (numbers are from current ecb unit test)

specific function sizes

Function code size in bytes stack usage in bytes notes
CM7_1T_AES_128_keyschedule_enc 132 24 uses Te2 table
CM7_1T_AES_192_keyschedule_enc 124 32 uses Te2 table
CM7_1T_AES_256_keyschedule_enc 208 36(40) uses Te2 table
CM7_1T_AES_keyschedule_dec 180 32 uses Te2 and Td2 table
CM7_1T_AES_keyschedule_dec_noTe 180 32 uses sbox and Td2 table
CM7_1T_AES_encrypt 408 40 uses Te2 table
CM7_1T_AES_decrypt 400 40 uses Td2 and inv_sbox table
CM7_sBOX_AES_128_keyschedule_enc 132 24 uses sbox table
CM7_sBOX_AES_192_keyschedule_enc 124 32 uses sbox table
CM7_sBOX_AES_256_keyschedule_enc 208 36(40) uses sbox table
CM7_DSPsBOX_AES_encrypt 466 40 uses sbox table

extra 4 bytes on stack comes from aligning stack to 8 bytes on ISR entry.

cortex-m55

no hardware available yet, TBD

RI5CY

RI5CY as in GAP8, not the later CV32E40P that is more constrained.

TBD

modes implementations

generic

cortex-m0/m0+

cortex-m3/m4

cortex-m7

specific function sizes

Function code size in bytes stack usage in bytes notes
CM7_1T_AES_CTR32_enc 860 72 (+1 arg passed on stack) uses Te2 table

cortex-m55

RI5CY

implementations (this part will be replaced later)

CM3_1T

cortex m3 and cortex m4 optimized implementation. Uses a single T table per enc/dec cipher and additional inv_sbox for final round in decryption.

Originally based on "Peter Schwabe and Ko Stoffelen" AES implementation available here.

32 bit LDR opcodes are aligned to 4 byte boundaries to prevent weird undocumented "feature" of cortex-m3/4 that prevents some pipelining of neighbouring loads. As well as other architecture specific optimizations.

LUT tables have to be placed in non cached and non waitstated SRAM memory with 32bit wide access, that is not crossing different memory domains (eg. AHB slaves). FLASH memory simply cannot be used since vendors usually implements some kind of cache, wide prefetch buffers, and waitstates that will anyway make cipher slower than boxless one.

CM7_1T

cortex m7 optimized implementation. Uses a single T table per enc/dec cipher and additional inv_sbox for final round in decryption.

Based on cortex m3/4 one, with carefully reordered instructions for cortex-m7 pipeline, to increase IPC and avoid data dependent stalls when accessing 2x32 bit DTCM (separate single ported SRAMs) on even/odd words.

The timmming issue can be visualized by following snippet (immediate vs register offset doesn't matter):

	tick = DWT->CYCCNT;
	asm volatile(""
			"movw r12, #:lower16:AES_Te0 \n"
			"movt r12, #:upper16:AES_Te0 \n"
			"ldr r0, [r12, #0] \n"
			"ldr r1, [r12, #8] \n"
			"ldr r2, [r12, #16] \n"
			"ldr r3, [r12, #24] \n"
			""::: "r0","r1","r2","r3","r12");
	tick = DWT->CYCCNT - tick - 1;

	printf("4 even loads, cycles: %lu\n", tick);

	tick = DWT->CYCCNT;
	asm volatile(""
			"movw r12, #:lower16:AES_Te0 \n"
			"movt r12, #:upper16:AES_Te0 \n"
			"ldr r0, [r12, #0] \n"
			"ldr r1, [r12, #4] \n"
			"ldr r2, [r12, #8] \n"
			"ldr r3, [r12, #12] \n"
			""::: "r0","r1","r2","r3","r12");
	tick = DWT->CYCCNT - tick - 1;

	printf("4 linear loads, cycles: %lu\n", tick);
	printf("This is why any two data dependent LDRs cannot be placed next to each other\n");

Only DTCM memory can be used for LUT tables, since everything else is cached through AXI bus. The timing effects of simultaneous access to DTCM memory by core and DMA/AHBS are yet unknown. (there was some changes in r1p0 revision: "Improved handling of simultaneous AHBS and software activity relating to the same TCM", details are of course Proprietary&Confidential)

XXX_DSPsBOX

Utilizes dsp instructions to perform constant time, quad (gf)multiplications in mixcolumns stage. MixCloums stage is parallelized according to this or this paper, InvMixColums is done through more straightforward representation.

Base ciphers performance (in cycles per block, some numbers are outdated)

Cipher function STM32F1 (0ws/2ws) - CM3_1T STM32F4 (0ws/7ws) - CM3_1T STM32F4 (0ws/7ws) - CM4_DSPsBOX STM32H7 - CM7_1T STM32H7 - CM7_DSPsBOX
setEncKey<128>
setEncKey<192>
setEncKey<256>
encrypt<128> 302 411
encrypt<192> 358 491
encrypt<256> 414 571
enc_unrolled<128> - 281 -
enc_unrolled<192> - 333 -
enc_unrolled<256> - 385 -
setDecKey<128> 412* (1T)
setDecKey<192> 500* (1T)
setDecKey<256> 588* (1T)
decrypt<128> 304 (1T)
decrypt<192> 360 (1T)
decrypt<256> 416 (1T)
dec_unrolled<128> - 282 -
dec_unrolled<192> - 334 -
dec_unrolled<256> - 386 -

Results are averaged over 1024 runs + one ommited (instruction) cache train run. setDecKey<> counts cycles required to perform equivalent inverse cipher transformation on expanded encryption key. * pipeline performance not fixed yet ** Cortex-M7 results may differ depending on the code around the caller (encrypt<128> should have 299 retired "uop pairs", goes up by e.g. 9 cycles if unrolled code is also compiled in)

XXX_1T_CTR32

Implements counter mode caching. Do not use if IV/counter is secret as it will lead to a timming leak of a single byte, every 256 aligned counter steps.

Cipher modes performance (in cycles per byte, some numbers are outdated)

Cipher function STM32F1 (0ws/2ws) - CM3_1T STM32F4 (0ws/7ws) - CM3_1T STM32H7 - CM7_1T
CBC_GENERIC<128> enc(+dec)
CBC_GENERIC<192> enc(+dec)
CBC_GENERIC<256> enc(+dec)
CTR32_GENERIC<128>
CTR32_GENERIC<192>
CTR32_GENERIC<256>
CTR32<128> 32.97* 32.91* 15.21
CTR32<192> 40.47* 40.41* 18.58
CTR32<256> 47.97* 47.91* 21.96
CTR32_unrolled<128> 30.72* 14.52*
CTR32_unrolled<192> 37.59* 17.77*
CTR32_unrolled<256> 44.47* 21.02*

F407 results assume that input, expanded round key and stack lie in the same memory block (e.g. SRAM1 vs SRAM2 and CCM on f407)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].