Indexing Code at Scale with Glean
132 points by GavCo 9 days ago | 30 comments
  • tomas789 9 days ago |
    This is certainly a step in right direction especially with proliferation of AI based assistants there will be a greater need to have readily available information about the codebase. This could easily take those copilots yet another level up.

    For example my workflow now with Cursor is to keep relevant code in spearate tabs even though I don’t work on the files. I found it makes the autocomplete better as at seems to me that all the active tabs are fed to the model. That means less space for me and more distraction. Glean might here.

  • jtokoph 9 days ago |
    I was really confused and surprised that Meta was using a commercial product for indexing instead of building in-house...until I realized that they weren't talking about the AI search indexing tool at glean.com
    • fintler 9 days ago |
      glean.com is pretty awesome. The responses it generates will have citations from our internal Jira, Wiki, Slack, Github, etc.

      It's also great for when I get pulled into a busy Slack channel and need a summary of what's been going on in there for the past week.

      • scrollaway 9 days ago |
        What's the pricing on it? Everything I see is "contact us".
        • staindk 8 days ago |
          Glean.com? We had an intro meeting with them, pricing only makes sense if you're in a first world country and have 100+ or maybe 150+ employees.

          I recall pricing started at 50k USD per year but may be remembering incorrectly. Please take this with a grain of salt as they may have changed their pricing models or whatever - I just get really annoyed at the "contact us" stuff so thought I'd try to help out here.

      • tomerbd 8 days ago |
        I'm a little bit confused is it the opensource that searches also in jira, wiki, slack, ..? https://github.com/facebookincubator/glean ?
        • dijit 8 days ago |
          I'm also confused, the link you shared is more akin to a sourcegraph alternative; but the parent is talking as if it's an LLM.

          I'm going to guess that there are to completely unrelated products that share a name.

          glean.com and your link (glean.software).

    • iandanforth 9 days ago |
      Yeah this naming is questionable. This definitely introduces confusion in the minds of consumers but I'm not sure if it's actionable. Any lawyers want to give some "I am not your lawyer" opinions?
      • loeg 9 days ago |
        Meta's tool was started by at least August 2021. The Glean commercial product wasn't launched until September 2021.
      • cma 9 days ago |
        It is a high burden to get a trademark on a 5 letter common English word. Usually can only be awarded after years in use and large popularity.
  • conqrr 9 days ago |
    Glean: https://glean.software/ System for collecting, deriving and querying facts about source code
    • lenkite 7 days ago |
      The article never mentions that Glean is written in Haskell
  • jepler 9 days ago |
    My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...
    • DylanSp 9 days ago |
      It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.
      • menaerus 8 days ago |
        Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.

        That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.

        > The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.

        [1] https://glean.software/docs/indexer/cxx

    • phyrex 9 days ago |
      It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.
    • UltraSane 9 days ago |
      The whole point of indexing data is to perform very expensive computation once and leverage the result many many times and it works really well.
    • kllrnohj 9 days ago |
      You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.

      Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.

  • tonymet 9 days ago |
    my favorite feature of code indexing at FB was how well integrated it was. Web search, cli search and IDE search all used the search index, but would reference your local context. This was useful for reference, call stack, dead code search.

    e.g. search results from ide search would link back to your local file. CLI results would reference your local clone.

    A great example of a small feature resulting in great usability.

    • Nathanba 9 days ago |
      By IDE search do you mean that it was using glean even in your local vscode? Does glean therefore work in combination with LSPs, because scip says that code modifications are a non-goal and now I wonder why somebody would create such a big tool only for local code to still just use LSPs and never use the server version of code navigation (scip or glean).
      • tonymet 9 days ago |
        It used the server version and references mapped back to the local clone when you clicked on a result.
  • rockwotj 9 days ago |
    Is there any UIs for this available openly? Or for glass? I am a former Googler and I know how awesome this kind of tooling is and it’s so hard to achieve with OSS. I would love open source code search. This seems very close but there is no UI layer (and it seems like meta uses this for code review and for IDEs) but a basic UI would be a good start
    • pepeiborra 8 days ago |
      Some people use the Glass command line client to integrate with Emacs/Vim/VSCode. Internally we have an LSP server that queries Glass, but it's not open source and some work would be needed to extract it. The only non trivial thing it does is position mapping to account for local changes.

      The integrations for code review and symbol search are both built for internal tools and not amenable to open sourcing.

      FWIW I agree that the lack of open source integrations are the main barrier for external adoption

      • rockwotj 8 days ago |
        Yeah understood that the internal meta stuff would be too tied up with internal infra to be OSS, it’s the same with a lot of Google’s tooling here.

        I do wish there was a startup here. There is sourcegraph, which has ok code search (github has come a long way, but without indexing and understanding the build you can’t do it justice). There are also cool code review startups like Graphite, but they don’t work together. I remember how powerful it was to review a change then go use an xref to see how a function is used that is untouched in a code review, so does not show in the diff, which requires checking out the changes locally in OSS land and context switching to leave comments.

  • YetAnotherNick 9 days ago |
    There are already 3 popular products with name glean with domains as .com, .ai and .co. This is glean with .software.
  • archy_ 9 days ago |
    When I read about these things, I cant help but wonder if anybody took a step back and thought "maybe we just have too much code"?

    At some point, perhaps you're just doing too much

    • dboreham 8 days ago |
      Career limiting thoughts.
      • sangnoir 8 days ago |
        Product and profit limiting too, if you're deleting profitable code for aesthetic reasons.
    • nthingtohide 8 days ago |
      There isn't too much code till the point we have automated asteroid mining.
  • PessimalDecimal 8 days ago |
    Google's equivalent to this is Kythe (https://kythe.io/). Earlier today I had noticed that Kythe ripped out its support for indexing Rust code and wondered what alternatives might exist. So iinteresting to see this right now! And it looks like it supports Rust (albeit via rust-indexer).