"I've been writing C for quite some time, but I never followed good conventions I'm afraid, and I never payed much attention to the optimization tricks of the higher C programmers. Sure, I use const when I can, I use the pointer methods for manual string copying, I even use register for all the good that does with modern compilers, but now, I'm trying to write a C-string handling library for personal use, but I need speed, and I really don't want to use inline ASM. So, I am wondering, what would other Soylenters do to write efficient, pure, standards-compliant C?"
As cool as the java implementation is, it may not be what the original poster is looking for. Sure it is very memory efficient, but optimizing for memory tends to reduce performance (as optimizing for performance often increases memory usage).
Maintaining a "library" of final/immutable strings means that every time you create or modify a string, you will have to search the list of existing strings for an existing copy. Likewise any destroyed or modified strings may need to remove an entry from the list. At this point, you are implementing some sort of a reference counting mechanism and/or garbage collection. Interesting (well, at least interesting to certain code geeks) problems to solve and play with, but maybe not exactly what you are looking for.
The big implication of all this is that anytime you modify a string, a new immutable string is (possibly) created. So think about appending text in a loop. For every iteration of that loop, a new immutable string is added (or found) in your library of strings. This can make certain text operations pretty expensive.
I think all of this stuff is worth thinking through, but if you are just looking to write a library to help you with or abstract common string operations, this is probably more work than you need to do. Either way from your original post, it sounds like you care more about cpu performance than memory usage, which probably means this is the wrong direction for you.
Either way, food for thought and a good mental exercise.
It really depends on the context of application. If he wants to manipulate strings ever so often, then that seems like an overkill. If he's going to be crunching strings like beta crunches F-bombs, then it might be a good idea to go beyond that.
I often say that Context Information is everything. And it really is.
deduplication is great for disks, but if I have something in memory, I want it to be fast.
it sounds 'elegant' for java to de-dupe strings but I'm not sure its going to be faster. I never pick elegant of speed, personally. elegant is for school. simple and obvious is much better for production code. and not having to search for instances of a string, just to de-dupe it, is not at all simple or fast.
As a Rubyist I challenge the idea that elegant is at odds with simple. The simpler code is generally more elegant.It is the more clever code that uses the excuse of being faster, not the more elegant. Elegant code is faster where the algorithm it uses is simpler. That is the essence of an elegant improvement; the code gets simpler and typically faster.
There are tradeoffs. Yes, you do extra work in some cases, but in others you do a lot less. For example, getting the length of the string becomes trivial rather than scanning the buffer (potentially wiping out your cache) Comparisons are, of coure, trivial. You do more accounting when you cat the strings, but you do less scanning and there's no chance of overrunning the buffer. It can also be a win if you have good optimization.
Which is overall better and faster will strongly depend on what you're doing. I do suspect that on average final strings will be a bit slower but safer.
Any sane string implementation (that is, not C's) will have a length field (or, alternatively, an additional pointer to end), so you don't have to scan the string to find out its length (an O(n) operation which, as you noted, will do bad things to your cache), but can just read the length (or calculate it in O(1) by simple pointer subtraction). This also means you'll not arbitrarily restrict the characters which can be stored in your string (C applications are generally easily identified by them being tripped off by a zero byte in positions supposed to contain text).