SoylentNews Comments | Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte

Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte

posted by janrinok on Wednesday May 16 2018, @11:51PM

from the as-easy-as-that dept.

Arthur T Knackerbracket has found the following story:

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them.
In a previous post, I pointed out that it takes about 8 cycles per byte to validate them using a fast finite-state machine. After hacking code found online, I showed that using SIMD instructions, we could bring this down to about 3 cycles per input byte.
Is that the best one can do? Not even close.
Many strings are just ASCII, which is a subset of UTF-8. They are easily recognized because they use just 7 bits per byte, the remaining bit is set to zero. Yet if you check each and every byte with silly scalar code, it is going to take over a cycle per byte to verify that a string is ASCII. For much better speed, you can vectorize the problem in this manner:
Essentially, we are loading up a vector register, comparing each entry with zero and turning on a flag (using a logical OR) whenever a character outside the allowed range is found. We continue until the very end no matter what, and only then do we examine our flags.
We can use the same general idea to validate UTF-8 strings. My code is available.

If you are almost certain that most of your strings are ASCII, then it makes sense to first test whether the string is ASCII, and only then fall back on the more expensive UTF-8 test.
So we are ten times faster than a reasonable scalar implementation. I doubt this scalar implementation is as fast as it can be… but it is not naive… And my own code is not nearly optimal. It is not using AVX to say nothing of AVX-512. Furthermore, it was written in a few hours. I would not be surprised if one could double the speed using clever optimizations.
The exact results will depend on your machine and its configuration. But you can try the code.
The counter-rolling can actually be done logarithmically by shifting 1,2,4, etc., eg:
[4,0,0,0] + ([0,4,0,0]-[1,1,1,1]) = [4,3,0,0]
[4,3,0,0] + ([0,0,4,3]-[2,2,2,2]) = [4,3,2,1]
but in this case the distances didn’t seem big enough to beat the linear method.
The distances can even be larger than the register size I believe if the last value in the register is carried over to the first element of the next. It’s a good way to delineate inline variable-length encodings.

Original Submission

This discussion has been archived. No new comments can be posted.

Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte | Log In/Create an Account | Top | 30 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

(1)

HAKMEM HAKMEM (Score: 2, Interesting) by Anonymous Coward on Thursday May 17 2018, @12:02AM (1 child)

by Anonymous Coward on Thursday May 17 2018, @12:02AM (#680561)

Nice to see that the spirit of HAKMEM still exists. For the young and/or ignorant, there is an intro here,
http://www.catb.org/jargon/html/H/HAKMEM.html [catb.org]
with a link to the full document at the end.
- Re:HAKMEM (Score: 4, Interesting) by dbe on Thursday May 17 2018, @01:13AM
  
  by dbe (1422) on Thursday May 17 2018, @01:13AM (#680580)
  
  Nice ref, that made me think about the fast inverse sqrt method that relies on evil magic numbers and bit-shifts to get a good approximation...
  https://en.wikipedia.org/wiki/Fast_inverse_square_root [wikipedia.org]
  coming-up with this solution is enough to fry a few brains.
  -dbe
  
  Parent
New language paradigm needed New language paradigm needed (Score: 4, Interesting) by Snotnose on Thursday May 17 2018, @12:56AM (8 children)

by Snotnose (1623) on Thursday May 17 2018, @12:56AM (#680578)

C is not a low level language. [acm.org]

--
When the dust settled America realized it was saved by a porn star.
- Re:New language paradigm needed Re:New language paradigm needed (Score: 2) by JoeMerchant on Thursday May 17 2018, @02:37AM (3 children)
  
  by JoeMerchant (3937) on Thursday May 17 2018, @02:37AM (#680598)
  
  Assembly used to be a "high level language" - now, anything without auto-type conversion and garbage collection in the base language (because you can get libraries for ANYTHING in C) seems to be called "low level."
  
  --
  🌻🌻 [google.com]
  
  Parent
  - Re:New language paradigm needed (Score: 2, Insightful) by Anonymous Coward on Thursday May 17 2018, @04:02AM
    
    by Anonymous Coward on Thursday May 17 2018, @04:02AM (#680618)
    
    Garbage collection is a low level job. Ask any city employee involved with it. :)
    High level city jobs though, involve exorbitant salaries for sitting in a suit and looking pretty. Oh yes, and signing your city into more debt.
    
    Parent
  - Re:New language paradigm needed Re:New language paradigm needed (Score: 2) by kazzie on Thursday May 17 2018, @12:25PM (1 child)
    
    by kazzie (5309) on Thursday May 17 2018, @12:25PM (#680690)
    
    The terms "low" and "high" infer that they are evaluated against an external reference.
    In the 1970s, not many languages could be said to be of a higher level than C. Therefore C was considered a high-level language.
    Today, we have countless programming languages that have actually been written in C. By comparison to all these, C is considered a low-level language.
    
    Parent
    - Re:New language paradigm needed (Score: 3, Insightful) by JoeMerchant on Thursday May 17 2018, @12:54PM
      
      by JoeMerchant (3937) on Thursday May 17 2018, @12:54PM (#680698)
      
      IOW: moving target.
      
      --
      🌻🌻 [google.com]
      
      Parent
- Re:New language paradigm needed (Score: 0) by Anonymous Coward on Thursday May 17 2018, @03:15AM
  
  by Anonymous Coward on Thursday May 17 2018, @03:15AM (#680611)
  
  "C is a relatively low level language." So says Kernighan and Ritchie. I'm pretty sure they would know. And so it will remain no matter how clever the implementation of a CPU becomes.
  
  Parent
- Compilers cannot do your thinking for you (Score: 0) by Anonymous Coward on Thursday May 17 2018, @08:22AM
  
  by Anonymous Coward on Thursday May 17 2018, @08:22AM (#680660)
  
  but every year some people still do act greatly surprised by it.
  
  Parent
- Re:New language paradigm needed (Score: 0) by Anonymous Coward on Thursday May 17 2018, @05:35PM
  
  by Anonymous Coward on Thursday May 17 2018, @05:35PM (#680782)
  
  Sigh, the stupid pedants are out.
  
  Parent
- Re:New language paradigm needed (Score: 2) by LoRdTAW on Thursday May 17 2018, @08:02PM
  
  by LoRdTAW (3755) on Thursday May 17 2018, @08:02PM (#680863) Journal
  
  Tell that guy to shut up. From the K&R introduction (The only C book you need):
  C is a relatively "low level" language. This characterization is not pejorative; it simply means that C deals with the same sort of objects that most computers do, namely characters, numbers, and addresses. These may be combined and moved about with the arithmetic and logical operators implemented by real machines.
  C by itself is a very simple language that abstracts the nuances of machine architecture from the programmer. Remove the standard library and you are left with very little by way of language. Hell, the second edition of the K&R book is 288 pages cover to cover and wonderfully acknowledged by the authors in the preface: "C is not a big language and is not well served by a big book" But you can do a hell of a lot with it, good or bad depending on how well you understand it. The rest of the bullshit this idiot is spouting has to do with poor implementation.
  
  Parent
Dude! JavaScript, bro. Dude! JavaScript, bro. (Score: 1, Funny) by Anonymous Coward on Thursday May 17 2018, @01:26AM (3 children)

by Anonymous Coward on Thursday May 17 2018, @01:26AM (#680582)

Dafuq? U r using C code on your blog??????
Learn to code, dude. JavaScript.
- Re:Dude! JavaScript, bro. Re:Dude! JavaScript, bro. (Score: 3, Interesting) by Snotnose on Thursday May 17 2018, @02:22AM (2 children)
  
  by Snotnose (1623) on Thursday May 17 2018, @02:22AM (#680594)
  
  Modded you up from flamebait, even though I know you're kidding.
  My point is that while one part of my brain knows all about out of sequence ordering, long pipelines, parallel pipelines, vector math, etc etc etc; another part of my mind, the one I use when I program, is still stuck on the Z-80 architecture. Something I didn't realize until I ran across that article a week or three ago.
  
  --
  When the dust settled America realized it was saved by a porn star.
  
  Parent
  - Re:Dude! JavaScript, bro. (Score: 3, Funny) by JoeMerchant on Thursday May 17 2018, @02:41AM
    
    by JoeMerchant (3937) on Thursday May 17 2018, @02:41AM (#680600)
    
    part of my mind, the one I use when I program, is still stuck on the Z-80 architecture.
    I'm sorry, I thought it was incurable when I first encountered it, apparently I was right.
    Viva la 6502!
    
    --
    🌻🌻 [google.com]
    
    Parent
  - Re:Dude! JavaScript, bro. (Score: 0) by Anonymous Coward on Thursday May 17 2018, @05:41PM
    
    by Anonymous Coward on Thursday May 17 2018, @05:41PM (#680787)
    
    My point is you're a douche.
    
    Parent
what am I missing? what am I missing? (Score: 2) by shortscreen on Thursday May 17 2018, @06:07AM (10 children)

by shortscreen (2252) on Thursday May 17 2018, @06:07AM (#680643) Journal

how helpful are vector instructions going to be when you can already test for bit 7 in multiple bytes at once by using plain ALU instructions: AND [ESI],0x80808080
- Re:what am I missing? Re:what am I missing? (Score: 3, Interesting) by ledow on Thursday May 17 2018, @07:55AM (4 children)
  
  by ledow (5567) on Thursday May 17 2018, @07:55AM (#680656) Homepage
  
  To be honest, would all these optimisations not be wiped out by compiler optimisations anyway?
  I'm fairly sure there isn't a modern compiler on the planet that would actually organise a tight for loop test on consecutive bytes in such a linear fashion.
  And "whether the string is ASCII or not" is fairly useless as a test if you're running it against enough strings for cycle counts to actually matter. Especially when your next action is a much more complicated "interpret the UTF-8 including control characters" and passing through to some other process (font rendering, file writing, etc.).
  I would imagine such an optimisation would only be useful for people routinely processing literally billions of lines of fresh text in unknown formats constantly, and more likely it would be more in their interest to just forcibly convert everything into Unicode anyway and then write all code on the assumption of Unicode strings.
  It seems to me to be a case of premature optimisation and/or creating a highly optimised yet non-portable and difficult to maintain version of "is_utf8_text". This is the kind of thing that, say, Freetype or a C++ string library might include in an platform-specific header that overrides a more generic implementation but otherwise doesn't really make any difference at all. And which is removed after about five years when everyone realises the compiler generates equivalent or better code anyway.
  Reminds me of the kind of thing I used to see in emulator code all the time - hand-crafted assembler to translate one platform's high-performance code to native code. There was always a point at which the guy who understood it left, leaving them with "optimised by known buggy asm core" and "unoptimised but nobody cares because it works and can be updated easily C core".
  
  Parent
  - "Compiler optimization" "Compiler optimization" (Score: 2) by MichaelDavidCrawford on Thursday May 17 2018, @07:01PM (3 children)
    
    by MichaelDavidCrawford (2339) <mdcrawford@gmail.com> on Thursday May 17 2018, @07:01PM (#680836) Homepage Journal
    
    I've never found a compiler that I couldn't beat with hand optimized c or c++
    I don't even need to use assembly
    
    --
    Yes I Have No Bananas. [gofundme.com]
    
    Parent
    - Re:"Compiler optimization" Re:"Compiler optimization" (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:02PM (2 children)
      
      by wonkey_monkey (279) on Friday May 18 2018, @08:02PM (#681350) Homepage
      
      I've never found a compiler that I couldn't beat with hand optimized c or c++
      Uh... and how did you turn that hand optimised C/C++ into executable code...?
      
      --
      systemd is Roko's Basilisk
      
      Parent
      - The problem with optimizing compilers The problem with optimizing compilers (Score: 2) by MichaelDavidCrawford on Friday May 18 2018, @10:13PM (1 child)
        
        by MichaelDavidCrawford (2339) <mdcrawford@gmail.com> on Friday May 18 2018, @10:13PM (#681420) Homepage Journal
        
        If you have stupid code, the best an optimizer can do is to make stupidity faster.
        Consider using the compiler to optimize bubble sort.
        Something I commonly do is defeat the cache. This because if you write one byte into a cache line, before that byte is written that line is filled by reading from the next layer of the cache. Now suppose you completely fill the cache with new values - that read from possibly main memory was useless.
        x86_64 has an assembly instruction that the programmer promises "I swear on a stack of bibles I will fill this cache line with entirely new values". That can speed things up quite a bit.
        PowerPC and POWER both have - strangely different from each other - assembly opcodes that sets a whole cache line to zero in just one cycle.
        Because of patents, every ISA has a different way to defeat the cache but most of them do offer that option.
        
        --
        Yes I Have No Bananas. [gofundme.com]
        
        Parent
        
        Re:The problem with optimizing compilers (Score: 2) by wonkey_monkey on Saturday May 19 2018, @05:56PM
        
        by wonkey_monkey (279) on Saturday May 19 2018, @05:56PM (#681620) Homepage
        
        That sounds less like "beating the compiler" and more like working with it by producing good code in the first place.
        x86_64 has an assembly instruction that the programmer promises "I swear on a stack of bibles I will fill this cache line with entirely new values". That can speed things up quite a bit.
        Where can I read up on that instruction? It sounds like it might be very useful to me.
        
        --
        systemd is Roko's Basilisk
        
        Parent
- Re:what am I missing? (Score: 0) by Anonymous Coward on Thursday May 17 2018, @08:19AM
  
  by Anonymous Coward on Thursday May 17 2018, @08:19AM (#680658)
  
  How about they help you test even more bytes at the same time? Say by a factor of 4-16x or so.
  
  Parent
- Comment Below Threshold (2 children)
  
  Re:what am I missing? Re:what am I missing? (Score: -1, Troll) by Anonymous Coward on Thursday May 17 2018, @09:53AM (2 children)
  
  by Anonymous Coward on Thursday May 17 2018, @09:53AM (#680673)
  
  how helpful are vector instructions going to be when you can already test for bit 7 in multiple bytes at once by using plain ALU instructions: AND [ESI],0x80808080
  We pack the register with multiple characters. SIMD is how we co-opt the GPU to do our work for us.
  
  Parent
  - Re:what am I missing? (Score: 0) by Anonymous Coward on Thursday May 17 2018, @04:01PM
    
    by Anonymous Coward on Thursday May 17 2018, @04:01PM (#680757)
    
    SIMD is how we co-opt the GPU to do our work for us.
    
    lolwhut
    
    Parent
  - Re:what am I missing? (Score: 0) by Anonymous Coward on Friday May 18 2018, @12:58AM
    
    by Anonymous Coward on Friday May 18 2018, @12:58AM (#680955)
    
    More like SIMD is how we use the CPU to its full potential so we don't need a GPU.
    
    Parent
- Re:what am I missing? (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:22PM
  
  by wonkey_monkey (279) on Friday May 18 2018, @08:22PM (#681363) Homepage
  
  Because (I think) vector instructions let you check (up to) 64 bytes at once, not 4.
  
  --
  systemd is Roko's Basilisk
  
  Parent
Conways Game Of Life Conways Game Of Life (Score: 2) by MichaelDavidCrawford on Thursday May 17 2018, @06:58PM (3 children)

by MichaelDavidCrawford (2339) <mdcrawford@gmail.com> on Thursday May 17 2018, @06:58PM (#680834) Homepage Journal

I'd like to use SIMD for Warp Life [warplife.com] but it hasn't been a priority because it is already very fast.
Presently I'm working on the UI.

--
Yes I Have No Bananas. [gofundme.com]
- Re:Conways Game Of Life Re:Conways Game Of Life (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:12PM (2 children)
  
  by wonkey_monkey (279) on Friday May 18 2018, @08:12PM (#681354) Homepage
  
  You seem like someone who might have some interesting answers to a question I have:
  I've written a compiler which takes a somewhat esoteric RPN-based language and turns the output into video. It can output RGB values based on pixel coordinates and frame number, which means it can output zooming Mandelbrots [youtube.com] (not realtime speed) and things like this [youtube.com].
  It can also refer back to its last output frame, which makes it ideal for simple cellular automata. I've implemented Life (~2000fps on a 360x360 grid) and made a nice fire effect [imgur.com].
  You seem like someone who might have some good ideas of other effects I could try to implement. So, any suggestions?
  
  --
  systemd is Roko's Basilisk
  
  Parent
  - Add a speed slider (Score: 2) by MichaelDavidCrawford on Friday May 18 2018, @10:09PM
    
    by MichaelDavidCrawford (2339) <mdcrawford@gmail.com> on Friday May 18 2018, @10:09PM (#681419) Homepage Journal
    
    Warp Life's was _highly_ nonlinear at first. After lots of thought and discussion I have a mostly-linear slide that uses a decaying average to determine what the actual speed of generation updates is.
    
    --
    Yes I Have No Bananas. [gofundme.com]
    
    Parent
  - Re:Conways Game Of Life (Score: 2) by cafebabe on Monday May 21 2018, @07:05PM
    
    by cafebabe (894) on Monday May 21 2018, @07:05PM (#682324) Journal
    
    I recommend rendering star-fields at warp. I also recommend using a variation of Warnock's algorithm to minimize processing overhead.
    
    --
    1702845791×2
    
    Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte

HAKMEM HAKMEM (Score: 2, Interesting) by Anonymous Coward on Thursday May 17 2018, @12:02AM (1 child)

Re:HAKMEM (Score: 4, Interesting) by dbe on Thursday May 17 2018, @01:13AM

New language paradigm needed New language paradigm needed (Score: 4, Interesting) by Snotnose on Thursday May 17 2018, @12:56AM (8 children)

Re:New language paradigm needed Re:New language paradigm needed (Score: 2) by JoeMerchant on Thursday May 17 2018, @02:37AM (3 children)

Re:New language paradigm needed (Score: 2, Insightful) by Anonymous Coward on Thursday May 17 2018, @04:02AM

Re:New language paradigm needed Re:New language paradigm needed (Score: 2) by kazzie on Thursday May 17 2018, @12:25PM (1 child)

Re:New language paradigm needed (Score: 3, Insightful) by JoeMerchant on Thursday May 17 2018, @12:54PM

Re:New language paradigm needed (Score: 0) by Anonymous Coward on Thursday May 17 2018, @03:15AM

Compilers cannot do your thinking for you (Score: 0) by Anonymous Coward on Thursday May 17 2018, @08:22AM

Re:New language paradigm needed (Score: 0) by Anonymous Coward on Thursday May 17 2018, @05:35PM

Re:New language paradigm needed (Score: 2) by LoRdTAW on Thursday May 17 2018, @08:02PM

Dude! JavaScript, bro. Dude! JavaScript, bro. (Score: 1, Funny) by Anonymous Coward on Thursday May 17 2018, @01:26AM (3 children)

Re:Dude! JavaScript, bro. Re:Dude! JavaScript, bro. (Score: 3, Interesting) by Snotnose on Thursday May 17 2018, @02:22AM (2 children)

Re:Dude! JavaScript, bro. (Score: 3, Funny) by JoeMerchant on Thursday May 17 2018, @02:41AM

Re:Dude! JavaScript, bro. (Score: 0) by Anonymous Coward on Thursday May 17 2018, @05:41PM

what am I missing? what am I missing? (Score: 2) by shortscreen on Thursday May 17 2018, @06:07AM (10 children)

Re:what am I missing? Re:what am I missing? (Score: 3, Interesting) by ledow on Thursday May 17 2018, @07:55AM (4 children)

"Compiler optimization" "Compiler optimization" (Score: 2) by MichaelDavidCrawford on Thursday May 17 2018, @07:01PM (3 children)

Re:"Compiler optimization" Re:"Compiler optimization" (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:02PM (2 children)

The problem with optimizing compilers The problem with optimizing compilers (Score: 2) by MichaelDavidCrawford on Friday May 18 2018, @10:13PM (1 child)

Re:The problem with optimizing compilers (Score: 2) by wonkey_monkey on Saturday May 19 2018, @05:56PM

Re:what am I missing? (Score: 0) by Anonymous Coward on Thursday May 17 2018, @08:19AM

Comment Below Threshold (2 children)

Re:what am I missing? Re:what am I missing? (Score: -1, Troll) by Anonymous Coward on Thursday May 17 2018, @09:53AM (2 children)

Re:what am I missing? (Score: 0) by Anonymous Coward on Thursday May 17 2018, @04:01PM

Re:what am I missing? (Score: 0) by Anonymous Coward on Friday May 18 2018, @12:58AM

Re:what am I missing? (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:22PM

Conways Game Of Life Conways Game Of Life (Score: 2) by MichaelDavidCrawford on Thursday May 17 2018, @06:58PM (3 children)

Re:Conways Game Of Life Re:Conways Game Of Life (Score: 2) by wonkey_monkey on Friday May 18 2018, @08:12PM (2 children)

Add a speed slider (Score: 2) by MichaelDavidCrawford on Friday May 18 2018, @10:09PM

Re:Conways Game Of Life (Score: 2) by cafebabe on Monday May 21 2018, @07:05PM