Stories
Slash Boxes
Comments

SoylentNews is people

posted by hubie on Saturday February 11, @10:28AM   Printer-friendly

GitHub built a new code-focused search engine in Rust because popular text search engines couldn't scale enough:

The Rust programming language continues to grow in popularity and now developer platform GitHub has used it to build its new code-focused search engine, Blackbird.

Instead of perusing forums for answers, GitHub wants users to use its search engine, which is currently in beta.

[...] "At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren't there plenty of existing, open source solutions out there already? Why build something new?" writes GitHub's Timothy Clem.

His short answer is that GitHub hasn't found success using general text search products to power code search.

"The user experience is poor, indexing is slow, and it's expensive to host. There are some newer, code-specific open source projects out there, but they definitely don't work at GitHub's scale," he writes.

[...] The Rust-written custom search engine, Blackbird, is more efficient and gives GitHub "substantial storage savings via deduplication and guarantees a uniform load distribution across shards", according to Pavel Avgustinov, VP of software engineering at GitHub.

He argues GitHub's scale means it can't use a Unix 'grep' (global regular expression print) for search. In effect, it would be too slow when considering the possibility of processing hundred of terabytes of code in memory. Queries would take too long.


Original Submission

This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Funny) by turgid on Saturday February 11, @10:50AM (5 children)

    by turgid (4318) Subscriber Badge on Saturday February 11, @10:50AM (#1291246) Journal

    Well if Microsoft says we should all be using Rust, I suppose I'd better start using it. For years they told us all we should be using C++.

    • (Score: 0) by Anonymous Coward on Saturday February 11, @03:33PM (4 children)

      by Anonymous Coward on Saturday February 11, @03:33PM (#1291274)

      That was before they decided that Windows 10 would be the last Windows ever.

      • (Score: 2) by turgid on Saturday February 11, @04:37PM (3 children)

        by turgid (4318) Subscriber Badge on Saturday February 11, @04:37PM (#1291290) Journal

        Is the new one going to be written in Rust?

        • (Score: 3, Funny) by Freeman on Monday February 13, @02:52PM (2 children)

          by Freeman (732) Subscriber Badge on Monday February 13, @02:52PM (#1291547) Journal

          Nah, it's too rusty. They need something more shiny to peddle.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
          • (Score: 2) by turgid on Monday February 13, @03:06PM (1 child)

            by turgid (4318) Subscriber Badge on Monday February 13, @03:06PM (#1291549) Journal

            And I suppose they didn't invent Rust so they can't control it, can they? Perhaps they'll create a Rust++ or a Rust# which is slightly incompatible and broken in subtle ways but PHB-friendly so millions of Microsoft developers the world over will be forced to learn it.

            • (Score: 3, Funny) by Freeman on Monday February 13, @05:59PM

              by Freeman (732) Subscriber Badge on Monday February 13, @05:59PM (#1291586) Journal

              Rust# has a nice ring too. Just try not to get tetanus from using it.

              --
              Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
  • (Score: 2, Informative) by Anonymous Coward on Saturday February 11, @01:29PM (17 children)

    by Anonymous Coward on Saturday February 11, @01:29PM (#1291256)

    He argues GitHub's scale means it can't use a Unix 'grep' (global regular expression print) for search. In effect, it would be too slow when considering the possibility of processing hundred of terabytes of code in memory. Queries would take too long.

    Haven't these bunch heard of indexing and caching? You don't have to support the fancier regex stuff. Prefix and suffix indexing has been around for ages.

    Also indexing hundreds of TB of text isn't that much nowadays for a global level company. Just a few dozen computers and hundreds of SSDs should do it.

    FWIW there are some papers out there on fast regex indexing/search.

    • (Score: 3, Informative) by turgid on Saturday February 11, @01:55PM (4 children)

      by turgid (4318) Subscriber Badge on Saturday February 11, @01:55PM (#1291258) Journal

      GitHub is Microsoft, remember. They're not the sharpest implements in the box.

      • (Score: 0) by Anonymous Coward on Saturday February 11, @03:34PM (3 children)

        by Anonymous Coward on Saturday February 11, @03:34PM (#1291275)
        I suspect most of the smart experienced people in Microsoft have retired or moved to Microsoft Research. And that explains the dismal state of Windows (8, 10, 11).

        SQL Management Studio takes ages to launch nowadays. Teams search often doesn't search - it finds stuff but you can't click on it and go to the message and the context.

        Maybe Microsoft's ChatGPT stuff will give about as useless/inaccurate results but with more human-like prose.
        • (Score: 0) by Anonymous Coward on Saturday February 11, @06:57PM (2 children)

          by Anonymous Coward on Saturday February 11, @06:57PM (#1291300)

          > And that explains the dismal state of Windows (8, 10, 11).

          Was there was a time when Windows was not in a dismal state?

          • (Score: 1, Informative) by Anonymous Coward on Sunday February 12, @10:56AM

            by Anonymous Coward on Sunday February 12, @10:56AM (#1291394)
            Windows XP and 7 were relatively good for the time.

            In contrast Desktop Linux has been dismal for ages. With developers wasting time on stuff like "wobbly windows".
          • (Score: 2) by maxwell demon on Sunday February 12, @11:25AM

            by maxwell demon (1608) Subscriber Badge on Sunday February 12, @11:25AM (#1291399) Journal

            Yes, before they started development, Windows was in the best state any Microsoft operating system has ever been in: Nonexistence.

            --
            The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 4, Interesting) by HiThere on Saturday February 11, @02:26PM (10 children)

      by HiThere (866) on Saturday February 11, @02:26PM (#1291259) Journal

      My first take was that it was the work of some Rust fan who wanted to prove that the language had some merit. Well, it has, but so have Haskell and Erlang. I surveyed the languages out there for my current project and even tried a couple. I decided on C++. Rust didn't make the first cut. It's not that I couldn't do it in Rust (I'm not sure), it's that I didn't like it. (The one I liked was D, but C would actually have been best except that I need hash tables.)

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 3, Touché) by Rosco P. Coltrane on Saturday February 11, @03:23PM (1 child)

        by Rosco P. Coltrane (4757) on Saturday February 11, @03:23PM (#1291270)

        I surveyed the languages out there for my current project [...] Rust didn't make the first cut [...] (I'm not sure) [...] I didn't like it.

        With such impartial and technically-grounded arguments, I'm totally convinced!

        • (Score: 4, Informative) by HiThere on Saturday February 11, @08:58PM

          by HiThere (866) on Saturday February 11, @08:58PM (#1291310) Journal

          No single datapoint will be decisive. It depends on your project, your existing skills, and how much you feel like trying another language. But I saw little in Rust to recommend it over go, D, erlang, Each of those would be better than Rust for some projects, and worse for others. C would have been best, because libraries generated in C are the most portable, but I needed a hash table, and didn't want to code my own or depend on some other (non-standard) library. (Also the glib hash table was...well, the documentation was difficult to parse. If I really need C I can rewrite later. It's not quite a trivial rewrite, as I use vector quite a lot, but pretty easy. But I'd probably use a hash table from one of my reference books rather than add an external library, even that one.)

          Rust? What languages can easily import libraries written in Rust? They may exist, or even be common, but I haven't run across references to them, which indicated to me that they'd be poorly documented.

          --
          Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 3, Interesting) by RS3 on Saturday February 11, @03:42PM (6 children)

        by RS3 (6367) on Saturday February 11, @03:42PM (#1291278)

        I haven't worked with hash tables in programming yet. You mean like this?

        https://www.tutorialspoint.com/data_structures_algorithms/hash_data_structure.htm [tutorialspoint.com]

        • (Score: 3, Informative) by HiThere on Saturday February 11, @09:03PM (5 children)

          by HiThere (866) on Saturday February 11, @09:03PM (#1291313) Journal

          Hash tables, in Python the name is Dictionary, in C++ unordered map. I don't remember what it's called in Java, but it's there. Most modern languages come with hash tables built in, but C dates from back when RAM was really precious. But I don't know why they haven't added them to the standard library.

          --
          Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
          • (Score: 2) by RS3 on Saturday February 11, @09:07PM (4 children)

            by RS3 (6367) on Saturday February 11, @09:07PM (#1291314)

            Would these [thoughtco.com] help?

            • (Score: 2) by HiThere on Sunday February 12, @12:24AM (3 children)

              by HiThere (866) on Sunday February 12, @12:24AM (#1291326) Journal

              Not really, as I don't want to depend on an external library, but I've bookmarked that link because it may be what I want for some other project.

              --
              Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
              • (Score: 2) by RS3 on Sunday February 12, @12:49AM (2 children)

                by RS3 (6367) on Sunday February 12, @12:49AM (#1291328)

                I wouldn't either. My thought was get some open source, check it over, and use it into your project, or compile it to your own static or dynamic library. I think? Maybe? Unless you're bound in some situation where you can't use open source. BTW, I don't do enough programming so maybe I'm completely off base here, but AFAIK, most C functions are in some kind of library, right? https://en.wikipedia.org/wiki/C_standard_library [wikipedia.org]

                maybe a little cleaner format: https://en.wikibooks.org/wiki/C_Programming/Standard_libraries [wikibooks.org]

                https://en.cppreference.com/w/cpp/header [cppreference.com]

                • (Score: 2) by HiThere on Sunday February 12, @01:47AM (1 child)

                  by HiThere (866) on Sunday February 12, @01:47AM (#1291337) Journal

                  Ah. OK. I plan to GPL the code after it's done, so that's not a problem. But I've already got source, I'd just need to adapt it.
                  But C++ is nearly as good as C for this purpose, as most things can take C++ libraries. (I could switch the vectors out pretty easily, as they're all vector. So If I need to do the conversion it's no big thing. I'd need to switch fstream to FILE*, and things like that.)

                  --
                  Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
                  • (Score: 2) by RS3 on Sunday February 12, @02:36AM

                    by RS3 (6367) on Sunday February 12, @02:36AM (#1291344)

                    Yay, he gets me! I'm terrible at concisely verbalizing my ideas.

                    The link I included to the open-source hash stuff was straight C, I believe (too tired to look...)

                    BTW, I like your tagline (did I mention that before?) My first comments on /. more than 20 years ago were about my worries with the trouble javascript can cause. It has far too much power / ability / functionality (hence most malware comes through javascript functionality...)

      • (Score: 2) by JoeMerchant on Saturday February 11, @04:38PM

        by JoeMerchant (3937) on Saturday February 11, @04:38PM (#1291291)

        Nevermind the drag-drop language for children jokes...

        Was this "from scratch" effort really clean room coding from requirements, or did they run a C++ -> Rust translator on the existing code base?

        --
        Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
    • (Score: 2) by Freeman on Monday February 13, @06:03PM

      by Freeman (732) Subscriber Badge on Monday February 13, @06:03PM (#1291587) Journal

      "Fast" regex indexing/search, sounds like an oxymoron. You need something to take 10x longer and be 10x harder to figure out what you did last time. Just do it in RegEx.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
  • (Score: 2) by Rosco P. Coltrane on Saturday February 11, @02:55PM (1 child)

    by Rosco P. Coltrane (4757) on Saturday February 11, @02:55PM (#1291262)

    That's surprising. They've been flogging their new acquisition so hard lately, it's hard to believe they came up with a search product that doesn't use it.

    • (Score: 1) by psa on Sunday February 12, @02:28AM

      by psa (220) Subscriber Badge on Sunday February 12, @02:28AM (#1291343) Homepage

      it's hard to believe they came up with a search product that doesn't use it.

      Yet. Doesn't use AI, yet. I'm sure it's on the roadmap.

  • (Score: 2) by turgid on Monday February 13, @06:35PM

    by turgid (4318) Subscriber Badge on Monday February 13, @06:35PM (#1291591) Journal

    Don't forget the Hammerite [hammerite.co.uk] to cover all those holes the borrow checker can't reach.

(1)