"I've been writing C for quite some time, but I never followed good conventions I'm afraid, and I never payed much attention to the optimization tricks of the higher C programmers. Sure, I use const when I can, I use the pointer methods for manual string copying, I even use register for all the good that does with modern compilers, but now, I'm trying to write a C-string handling library for personal use, but I need speed, and I really don't want to use inline ASM. So, I am wondering, what would other Soylenters do to write efficient, pure, standards-compliant C?"
Look at the java implementation of Strings. It is pretty interesting. Strings are final and are reused. So if you have 1000 instances of the String "yes" in your program at once, only one area of the memory will contain "yes".
There is other nifty tricks as well in the string copy operations.
Get the source code.
We call those "const" in my hood, nigga.
"const" don't have the copy on write feature. You can't even write them. So sorry, we are not talking about the same thing..
Not the same thing at all. Const can't be assigned at all (other than the initial assignment). Thi is entiorely different. There is a string pool containing one and only one copy of every string assigned anywhere. If you have char *a="hello " and char *b="there" and char *c = "hello there!", there will be 3 strings stored. If you concatenate a and b assigned to char *d, d will point to the same instance of "hello there" as c does (and it's reference count will be incremented).
Unlike a const, you can then do d = "Never Mind".
You mean whenever I change a string in Java, the JVM runs through ALL my strings in memory to make sure I don't have a duplicate string already? and there's no opting out??
Yikes, that sounds like a huge waste of CPU cycles. I'll happily trade memory to get that performance back, thanks.
It's done with hashes.
I'm not sure if this still applies with newer versions, but maybe it's unrelated: https://stackoverflow.com/questions/16123446/java- 7-string-substring-complexity [stackoverflow.com]
But as far as I know, Qt strings still behave this way.
String a = "1234567890";a = a.substring(0,2); // a is now "12"
1) No memory copy has occured.2) "1234567890" is still kept into memory with a reference count of 1
Now (apparently according to the link you posted):
1) Memory copy has occured.2) "1234567890" is eligible for garbage collection while a new area of memory has been assigned to "12".
IMHO, there is work to be done to combine the 2 approaches. Hint: Keep old behavior but revert back to new behavior in the background after a given TTL where "1234567890" hasn't been accessed, then transparently copy the substring "12" to a new memory area.
I know it sounds very complicated but it is pretty simple compared to the many flavors of garbage collection floating around ;-)
As cool as the java implementation is, it may not be what the original poster is looking for. Sure it is very memory efficient, but optimizing for memory tends to reduce performance (as optimizing for performance often increases memory usage).
Maintaining a "library" of final/immutable strings means that every time you create or modify a string, you will have to search the list of existing strings for an existing copy. Likewise any destroyed or modified strings may need to remove an entry from the list. At this point, you are implementing some sort of a reference counting mechanism and/or garbage collection. Interesting (well, at least interesting to certain code geeks) problems to solve and play with, but maybe not exactly what you are looking for.
The big implication of all this is that anytime you modify a string, a new immutable string is (possibly) created. So think about appending text in a loop. For every iteration of that loop, a new immutable string is added (or found) in your library of strings. This can make certain text operations pretty expensive.
I think all of this stuff is worth thinking through, but if you are just looking to write a library to help you with or abstract common string operations, this is probably more work than you need to do. Either way from your original post, it sounds like you care more about cpu performance than memory usage, which probably means this is the wrong direction for you.
Either way, food for thought and a good mental exercise.
It really depends on the context of application. If he wants to manipulate strings ever so often, then that seems like an overkill. If he's going to be crunching strings like beta crunches F-bombs, then it might be a good idea to go beyond that.
I often say that Context Information is everything. And it really is.
deduplication is great for disks, but if I have something in memory, I want it to be fast.
it sounds 'elegant' for java to de-dupe strings but I'm not sure its going to be faster. I never pick elegant of speed, personally. elegant is for school. simple and obvious is much better for production code. and not having to search for instances of a string, just to de-dupe it, is not at all simple or fast.
As a Rubyist I challenge the idea that elegant is at odds with simple. The simpler code is generally more elegant.It is the more clever code that uses the excuse of being faster, not the more elegant. Elegant code is faster where the algorithm it uses is simpler. That is the essence of an elegant improvement; the code gets simpler and typically faster.
There are tradeoffs. Yes, you do extra work in some cases, but in others you do a lot less. For example, getting the length of the string becomes trivial rather than scanning the buffer (potentially wiping out your cache) Comparisons are, of coure, trivial. You do more accounting when you cat the strings, but you do less scanning and there's no chance of overrunning the buffer. It can also be a win if you have good optimization.
Which is overall better and faster will strongly depend on what you're doing. I do suspect that on average final strings will be a bit slower but safer.
Any sane string implementation (that is, not C's) will have a length field (or, alternatively, an additional pointer to end), so you don't have to scan the string to find out its length (an O(n) operation which, as you noted, will do bad things to your cache), but can just read the length (or calculate it in O(1) by simple pointer subtraction). This also means you'll not arbitrarily restrict the characters which can be stored in your string (C applications are generally easily identified by them being tripped off by a zero byte in positions supposed to contain text).
In C, if the strings are constant (I mean, if they are hard-coded) their memory positions are reused, too. This is one of the most basic optimizations C compilers do.
Doesn't need to go as far as java lies. std::string in C++ is much nearer and has the same COW optimization.https://en.wikipedia.org/wiki/Copy-on-write [wikipedia.org]