4.5M Suspected Fake Stars in GitHub
236 points by qianli_cs 12 days ago | 212 comments
  • mentalgear 8 days ago |
    In a world with so much fake PR stuff and AI slop, any and all project that tries to verify what's real and what's not is an excellent choice of topic, fostering integrity again in the our industry.

    Here's the actual repo: https://github.com/hehao98/StarScout

  • hobs 8 days ago |

      Allow it a week to finish all iterations and expect it to read >= 40TB of data. You can use nohup to put it as a background process. After a week, you can run the following command to collect the results into MongoDB and local CSV files:
    
    I just love the yolo nature of "well let's check in a week if that 40TB of data processing worked"
    • queuebert 8 days ago |
      Reminds me of this famous paper: https://arxiv.org/abs/astro-ph/9912202
      • krick 7 days ago |
        I suppose it must be a joke, but it's striking nevertheless how much clearer and better written this paper is. I absolutely hate these papers that have an abstract of "In this paper, we present a systematic, global, and longitudinal measurement study" and the conclusion of "In this paper, we have presented a systematic, global, and longitudinal measurement study". It's almost surreal that this, on the contrary, is not a joke, and the majority of papers are written in this idiotic manner.
  • prepend 8 days ago |
    I don’t like stars as a metric. Or at least as a comparator. If you brag about having a millions stars that says something as a million is a lot.

    But if you brag that your project as a million and your competitor has half a million, that is so illogical that I would discount your project and think it’s run by dummies.

    Are there practical situations where people really need stars enough to buy them?

    • hiccuphippo 8 days ago |
      My only guesses are people showing popular repos for their CV or to appear legitimate to get access to another repo like what happened with the xz utils backdoor.
      • bdcravens 8 days ago |
        There's also the third category of projects receiving funding.
        • datadrivenangel 8 days ago |
          And for a while startups were using it as a traction metric for open core projects when pitching to VCs.
    • pembrook 8 days ago |
      Once you start trying to make a living from anything you do online, you start to realize that literally everything on the internet is gamed to extreme. Even this article was written and posted here for a reason.

      If your GitHub repo can in any way provide you with income (from just having something to talk about in an interview or an innocuous “buy me a coffee link”…all the way up to selling $100,000/yr enterprise support plans), you now have a strong incentive to game the system.

      And if it’s allowed by the system, then it’s a prisoners dilemma. Because if you DON’T do it, your competition will do it and eat your lunch.

      That’s why it’s so important to design high integrity ranking systems.

    • magic_smoke_ee 8 days ago |
      The "like" metric is dumbed-down to self-amplifying popularity that hovers around meaninglessness. It would be more valuable to weight things based who else you respect also rate a particular item.
    • mrweasel 8 days ago |
      Github really wants to be a social network or something to that effect and I get the feeling that most developer don't care. If you log in it's pretty clear that the "front page" is suppose to be something like a feed, but I don't know anyone who uses it. Mine is completely blank and pointless. Stars I suppose is to be something akin to a like, maybe.

      I have plenty of Github projects bookmarked, but I never "stared" one... Why would I?

  • dang 8 days ago |
  • zitterbewegung 8 days ago |
    All metrics will be gamed at some point. I don't know exactly how you could even fight this.
    • jasoneckert 8 days ago |
      Neither do I.

      I believe the only thing anyone can do is take metrics of how the metrics are gamed, as this particular paper has done.

    • jazzyjackson 8 days ago |
      there's various reasons webs-of-trust don't takeoff, but I can imagine a system where the metrics I see are only aggregated from friends-of-friends, and any other signal is just considered untrustworthy and therefor not worth observing
      • drusepth 8 days ago |
        Do you still trust that system when your friends-of-friends are the ones gaming the system? Given the inherent network effects of manipulating webs of trust, I wouldn't be surprised if everyone had at least one friend-of-a-friend they shouldn't necessarily trust.
        • morkalork 8 days ago |
          Given all the obvious bots and sketchy recruiters that try to connect with me on LinkedIn, who all appear to have at least one mutual connection, it probably won't work.
          • jagged-chisel 8 days ago |
            Do we have a similar issue on GH? I think the nature of the service and its target audience affect this problem in a big way. You can follow anyone on GH, but there's no mutual connection option at all. LI has following and mutual connections. LI also has a much wider audience.

            How might a 'connection' look on GH? Will people freely connect, or will they appraise requests more closely?

      • wruza 8 days ago |
        I can imagine access to raw data instead of some stupid come-on-game-me-able predefined indicator, and that I can run some private statistical analysis over it. People would use (and share) different algorithms and gamers will at least wander through this collectively created mud without any understanding except for the defaultest measures.

        But of course this is too complex and “no one will use it” (tm). So we’ll better have a screwed up recommendation system that doesn’t work at all, cause that’s simpler!

      • codetrotter 8 days ago |
        I can only speak for me personally. For me the way that I use GitHub I don’t think the concept of “friends of friends” would be all that useful on GitHub.

        There are a handful of people that I know IRL that I follow on GitHub. And a few hundred that I follow in total. Out of the handful of people I know IRL, and who I follow on GitHub, only two or three of them are active there any given week. All of the other people I follow I have very little idea who they are. Usually I follow people I don’t know if I come across their profile and either the profile itself or their projects make me follow them. But I star way more different repos than the number of people I click follow on.

        For me, the main way of discovering new repos are:

        - Frontpage of HN, and comments in posts on HN.

        - Specific search results on Google when I have searched for libraries or programs that do specific things.

        - Libraries on crates.io that I think might be interesting to look into in the future.

        Maybe once or twice a month I happen to click on the main page of GitHub itself and see mentions of repos that have been committed to or starred or created by people I follow.

        So for me I don’t think “friends of friends” is a particularly great signal for things to look at. Most of the people I follow, I don’t know much about them.

        Likewise, for anyone that follows me it’s not necessarily any strong signal that I follow someone else in order to determine if activity from that someone else should be shown or weighted as more significant to my follower just because I happen to follow that other person.

        If you do want a strong signal for who to boost for my followers based on my own activity, go and look at the dependencies that I am using in my own projects. That’s a pretty good indicator that I put some amount of effort and interest into looking at something. This could be done by GitHub itself, parsing the Cargo.toml files of my projects and extracting the dependencies section and looking up which of those dependencies are hosted on GitHub.

      • kube-system 7 days ago |
        Maybe so, but in this case, I don't think 'stars' is a good candidate for one of those metrics. I think the people worried about 'fake stars' are doing it wrong, and should just ignore the metric entirely.
      • yencabulator 7 days ago |
        For my account, only count stars by the top 50% of the contributors to the projects I have starred?
    • begueradj 8 days ago |
      It comes down to fighting against the human nature. And that's a lost battle.

      Set any law you want, our nature will push us to circumvent it even legally.

      • thrance 8 days ago |
        Not nature no, it's all about incentives. Oftentimes it's financial, for github stars it's prestige and visibility.
      • mentalgear 8 days ago |
        Most people are happy living in a fair ecosystem - it's only the 1-2% of the population that seek control, money and power that start trying to exploit the system.

        Only if we let that minority keep manipulating the system without consequences, it becomes the driving market force that the rest of the population also feels they have to comply to, to go along, as it already has happened in finance, academia, etc.

        • JumpCrisscross 7 days ago |
          > Most people are happy living in a fair ecosystem

          For varying and self-serving definitions of fair. (Almost everyone in the rich world is in an unfairly-advantaged minority.)

        • tcmart14 7 days ago |
          I'd push back against the 1-2%. I think the reality is, 1-2% is the group of people who will exploit the system and more importantly, have the means to do so. But the number of people who would exploit the system is probably quiet a bit higher, but it doesn't matter because they don't have the means to do so.
      • vouaobrasil 8 days ago |
        I don't really think so. The Amish have a nice system. Their society has many fewer bad actors compared to general society.

        Actually one of the keys is repeated contact. People who have to interact again and again will try and game the system less. Not sure how to build that into a star system but why give up so easily? Do programmers give up when you say "this algorithm can't be made any faster?"

        • JumpCrisscross 7 days ago |
          > one of the keys is repeated contact

          The other is hierarchy. You can't automate reputation scoring.

        • eddythompson80 7 days ago |
          I don't think it's just the Amish. Collectivist cultures in general have (or maybe perceived to have, I don't know) fewer bad actors compared to individualistic cultures.

          It doesn't matter if people have to interact frequently if there is no real consequences to that interaction. The punishment in those collectivist cultures involves social shunning, shaming, etc. Individualistic cultures almost pride themselves on how much they can disregard social shunning and shaming. Shameless people are celebrities and elected officials. They are admired as opposed to shunned and ignored. A bad actor in an Amish community is expelled and loses access to what that community offers. That would be illegal in the general society unless their "bad act" was actually illegal. Discriminating against someone for being a dickhead who exploits loopholes and unregulated corner cases (without explicitly breaking the law) would be illegal in many contexts.

          > Not sure how to build that into a star system but why give up so easily? Do programmers give up when you say "this algorithm can't be made any faster?"

          I don't think people have given up. Online fraud detection is a massive industry as is. Spotify plays, YouTube views, Google search, Amazon reviews, reddit upvotes, twitter's retweets, facebook likes/shares, etc all fall exactly into the same bucket. There is even a significant dollar amount attached to many of those more so that GitHub stars. All are frequently gamed/faked and it's a battle between the platforms and the adversary

          • vouaobrasil 7 days ago |
            Good points. I'll only add that I just mentioned the Amish because it's the only culture ("subculture?") that I've read thoroughly about. But I think in collectivist cultures it is indeed much harder to be a bad actor. Perhaps we should have a little more shunning...
          • lupire 7 days ago |
            You are describing small communities, not collectivist ones.

            Large "Collectivist" communities have body count in the hundreds of millions.

        • keybored 7 days ago |
          We’re talking about a stupid gamification and notoriety metric. I won’t be losing sleep over “human nature” failing to honor it.
          • vouaobrasil 7 days ago |
            Me neither. I don't give a darn about GitHub. It could burn in hell for all I care. But my comment was about this phenomenon in general, not how it affects some Microsoft service.
        • seventytwo 7 days ago |
          Amish culture doesn’t scale across humanity though. It’s a walled garden.
        • yencabulator 7 days ago |
          Repeated contact is just one mechanism, there are others that scale to country size. The end result is called a high trust society.

          https://en.wikipedia.org/wiki/High-trust_and_low-trust_socie...

          For example, Tokyo has a lot of people and they actively dislike interacting with strangers but if you leave your laptop unattended while peeing at a coffee shop, it's very unlikely to have been stolen.

      • banannaise 7 days ago |
        It's not fighting against human nature, it's fighting against the incentives of our economic system and the people who exploit them. What you're doing here is sometimes referred to as naturalizing - acting like something is the natural state of things, when it is specific to a present social system.
    • nwienert 8 days ago |
      Only show "Active Developer Stars" by default:

      - Only accounts that have a decent amount of activity (pushing code, commenting, etc)

      - Has set up SSH

      - Older than 2 years

      - Account active consistently for at least a year

      - Must have 2-factor enabled

      - Filled out profile

      etc

      • stevage 8 days ago |
        So now all the bots are pushing code, have SSH etc...
      • zitterbewegung 8 days ago |
        I've heard of gaming GitHub stars by asking their friends to star their projects which would get around all of your bullets. Hence why I said it would be hard to game.
      • eddythompson80 8 days ago |
        All of those are very, very, easy to automate. There are plenty of bot accounts that have unintentionally checked the full list.
        • nwienert 7 days ago |
          You can find a set of requirements that aren't. Eg 2-factor can include phone number. And activity requirements can be based on repo maturity (no just pushing to random empty repos).

          And while some boy accounts may have them, I doubt many have most.

          Also, you argue on semantics but the general idea of setting up a legitimacy test that factors in various things is very easily doable, the factors can be kept private, and you definitely can find ones that are generally hard to game.

          • gruez 7 days ago |
            >You can find a set of requirements that aren't. Eg 2-factor can include phone number. And activity requirements can be based on repo maturity (no just pushing to random empty repos).

            Then you have people complaining about being "shadowbanned" (because there's no recourse if you're a person and the algorithm thinks you're not active enough), or that github is being anti-privacy (by requiring phone number). It's hard to win here.

            • wholinator2 7 days ago |
              I think the point is that these requirements are not published, and they are not requirements to use stars. Anyone can star, no one knows whether their account is contributing to the star count. Now, presumably you could star a thing and check if the number went up but maybe introduce slight randomness or delay to obfuscate even those details. I remember when reddit removed the total upvote/downvote counts from the ui
              • eddythompson80 7 days ago |
                The point is that this is not arguing on semantics nor is it as simple as just a "set of requirements" that they just follow. Battling fraud online is an entire business in itself. Take Spotify plays, YouTube views, Google search ranking, Amazon reviews, reddit votes, etc. These organizations have significantly more incentives than GitHub to reduce fraud in these metrics, and while they do, it's still really really hard and it's very easy to show how these metrics are gamed/faked all the time.

                It's not a matter of "here is a list of requirements that no one knows about, and here is slight randomness/delay to obfuscate".

                How much do you think it takes to pay an actual human from a poor country to come to work each day at 8am, create one github account after another, enter them in a database, and leave at 5pm?

                If you want to "study" how github handles stars because there is legitimate financial incentive for you in it, for $100 a day you can pay 10 or 20 of those people to create few thousands accounts a day. Do it few times a month, and throw these accounts in an automated system that creates random repos, pushes a few commits here and there, etc. Also "introduce some slight randomness or delay to obfuscate these events". Do some A/B testing to figure how the 300k accounts under your control affect a repo star system, then advertise a "GitHub stars service" "$0.50 per guaranteed star on Github". Your average VC funded startup could get 10k stars for $5k.They probably give AWS 10 times that a month.

                Once github changes their requirements, do more testing, figure out what the requirements now are, then you're back in the game. If people do it all the time to Spotify, YouTube, Google, Amazon, Reddit, and Twitter, why do you think GitHub would somehow crack that nut?

                • wruza 7 days ago |
                  As someone working with people on the other end of this table, I can tell you there’s a limit of risk, clarity and tech complexity that they are ready to bear. And it’s pretty low. It all works for them only because threads like this usually end up with “it wouldn’t work anyway if I, a six figure guy, had all the time and budget in the world to defeat it, so let’s do nothing” type of non-solution. Which creates a defeatist spirit culture. Paying third-world workers is often economically and structurally unviable for the most low-hanging bot-like activities and it doesn’t even stay that cheap either once the demand grows due to technical barriers. I, being a lot less paranoid and defeatist, also tell these guys that it won’t work because this and that, but then it works, because the solutions the defending side comes up with are either laughable or from “so dumb, I feel I’m gonna faint” category. You won’t believe the elephants that can fly under their radar.

                  people do it all the time to Spotify, YouTube, Google, Amazon, Reddit, and Twitter, why do you think GitHub would somehow crack that nut?

                  Because the listed projects do basically nothing, a bare minimum. They don’t even care as long as bots don’t play against their direct interest. Who cares at a media company, or a sales company, who exactly is at their top, as long as they are both not bad enough? Profits come either way. They all are shittiest examples of it who created, incorporated and are themselves part of this problem.

                  It’s akin to immune system. Its goal is not to protect you from every hiv and cancer, but to avoid constant infections from stupid low-effort attacks. You don’t have to make it prefect, but it must be there. The more cryptic it is, the less welcoming it is to game it through basic means, the better.

              • JumpCrisscross 7 days ago |
                > the point is that these requirements are not published

                Well-connected people will get the tip off. And your PR team will have to keep batting down conspiracy theories, since if there's one thing the nutters love it's black boxes.

          • precommunicator 7 days ago |
            > Eg 2-factor can include phone number

            In GitHub organization settings you can require to only use secure 2FA which kicks anyone who use SMS 2FA out.

        • krick 7 days ago |
          Ironically, I suspect there are more "real" accounts that don't check the full list. Reminds me of most primitive captchas somehow.
      • the__alchemist 8 days ago |
        Hmm. I don't have SSH, but have many GH projects, and have been active for a decade. So, I would be filtered out as not an active dev, with the spammers?
        • nwienert 7 days ago |
          Sure, but at least stars would be net more useful.
    • uludag 8 days ago |
      I believe networks of human individuals can solve this to a good degree assuming a particular topology exists.

      Like, imagine a group of professionals of decent sized, all specializing in a similar field, and having lots of strong connections between each other where they have ample opportunities to share information. It would be hard for an outsider to come in and astroturf their product without immense effort (like hiring shills to attend conferences). In-person networks also obviously solve the problem stars as reputation: reputation spreads naturally in these sorts of networks.

      I think the problem comes with algorithmic scale. Maybe a solution would be to have more community building activities (maybe preferably offline).

    • aydyn 8 days ago |
      Requiring real ID and showing _regional_ stars like Apple/Google would be a start.
      • eddythompson80 8 days ago |
        > Requiring real ID

        Yeah, people would love that for sure.

        > showing _regional_ stars like Apple/Google would be a start.

        What does that mean? I thought regions only impact ranking not the net amount of stars (assuming we're talking about Apple/Google Maps). Which as far as I know, github doesn't do ranking.

        • aydyn 7 days ago |
          People already use github as a professional portfolio. Facebook uses real ID, and how popular is it?

          > What does that mean? I thought regions only impact ranking not the net amount of stars (assuming we're talking about Apple/Google Maps). Which as far as I know, github doesn't do ranking.

          At least on IOS reviews and ratings are by country, I dont actually know about google play though. (I dont have an android to check since I am not poor)

      • stronglikedan 8 days ago |
        > Requiring real ID

        Sir, this is an HN.

        • aydyn 7 days ago |
          Oh thank god I thought we were at Wendy's
    • mentalgear 8 days ago |
      doesn't mean why shouldn't fight back. That's exactly why we need research projects like these: to maintain the balance.
    • 1propionyl 8 days ago |
      Any metric that becomes a target ceases to be a good metric.

      The wrinkle is that measures that don't easily quantify are more resistant. For example, showing provable use by other reputable or trusted projects, or a significant amount of resources allocated to maintenance, or ...

      Really just anything that can't be reduced to a single number in a canonical way will in the long run prove far more useful for longer.

      This of course shifts some of the burden onto potential users to assess things more critically, and forecloses direct numerical comparison. But the idea that you could just look at a number and make such comparisons was faulty from the get go.

    • sedatk 8 days ago |
      Prioritize the stars given by accounts you follow in the UI. Done.
      • p1esk 8 days ago |
        I don’t want to follow anyone, but I do give stars to repos I like.
        • sedatk 8 days ago |
          Then you'll have to start following the creators of repos you like to build a web of trust.
    • awkward 8 days ago |
      I can see github platform internals caring about this for anomaly detection, but as a developer, who cares? I suppose a botnet could be making fake stars on a malware project or supply chain attack, but the problem there doesn't seem like it's the number of stars.
    • ozim 7 days ago |
      Make a website where you hire people to check out the libraries and publish scores.

      But people who have chops for that probably have high enough paying jobs not to care. As most likely no one would pay for reviews of libraries.

  • dzonga 8 days ago |
    do stars even count ?

    my determination to use a project is 1. the readme 2. the issues

    • tonymet 8 days ago |
      recent commits and community engagement are better indicators
      • Retric 8 days ago |
        I’d generally rather use a library that hasn’t needed to update in 5 years than something in active development.
        • insane_dreamer 8 days ago |
          the challenge is differentiating between "haven't need to update it in 5 years because it still solid and compatible with its ecosystem" vs "haven't updated it in 5 years because of any other reason"
        • sixothree 8 days ago |
          Sounds good in theory. But almost every time I use one of these projects, it's in "abandoned" status and definitely needs attention. There is 1 project I can point to that I use that does not actually need any maintenance and another that honestly makes me _extremely_ nervous to use because of lack of maintenance.
        • mardifoufs 8 days ago |
          Can you give me some examples? Because in my experience even very stable, very "foundational" libs and frameworks that I know about and use almost never go 5 years without any commit/change. There's always either a small bug fix, or some update to a build script, updated documentation, or something.

          The only repos where that's not the case are usually very niche, and in that case it becomes very hard to judge if the library is just very stable or a minefield of bugs and undesired behavior that no one else reported because no one else is using it.

          • Retric 7 days ago |
            Being able to tell if something doesn’t need to be updated is a separate question. There’s many signs such as lacking external dependencies, but the point is such software exists not that there is an easy heuristic for finding it.

            I think the best examples are the reference implementation of some algorithm. There’s generally room for improvement, but keeping it simple is the point.

        • tonymet 7 days ago |
          openssl?
    • renewiltord 8 days ago |
      It used to be a heuristic VCs would use to gauge popularity. You know how it is: if you have the revenue, talk about the revenue; if you only have the users, talk about the users; if you only have the stars, talk about the stars hehe
    • muglug 8 days ago |
      Sometimes projects get stars just because people like the personality or company behind the project.

      Case in point: https://github.com/facebook/hhvm/. It got 15,000 stars in its first few years, but roughly 10 non-Facebook companies actually ever used it in production, and today only one non-Facebook company uses it (I work at that company).

      • consumer451 8 days ago |
        Sometimes, they are surreal stars for surrealist languages that zero people actually use:

        https://github.com/TodePond/DreamBerd - 11.7k stars

        • humbugtheman 19 hours ago |
          rude. i use dreamberd
          • consumer451 19 hours ago |
            I didn’t mean to be, I genuinely didn’t know that people did that.

            May I ask how/in what in context?

      • michaelmior 7 days ago |
        That doesn't mean that the stars are just because people like the company. People may find the technology interesting even if they have no intent of using it.
    • wildzzz 8 days ago |
      A star is just a bookmark for me. It says nothing beyond "I may want to look at this again". When comparing two similar projects, I may look at the star counts to see which one is more popular but it's probably the last metric I'd consider.
      • NewJazz 6 days ago |
        Exactly. I stopped starring repos a long time ago and now just bookmark them instead.
    • glaucon 7 days ago |
      I agree, I am also interested in : date of most recent substantive commit; date of first commit; number of contributors.

      I don't have hard and fast rules for how I interpret those values, it depends on my intentions but I find them useful things to consider.

      Going back to the readme, nothing turns me off faster than a skeletal readme, it doesn't have to be "War and Peace" but it needs to be more than just how to install it.

      • martinsnow 7 days ago |
        But first and foremost you must find the interesting repositories. I frequently use the search where i filter by stars, last push and language.
    • NewJazz 6 days ago |
      I like to look at open PRs as well as closed issues. Sometimes closed issues reveal a lot about the attitude and expertise of the maintainers.
  • attentionmech 8 days ago |
    I think number of clones is a much better metric (it's like proof of work, it needs compute to clone a repo). For me starring a repo is liking bookmarking it, nothing else. They might as well just mark it as "Bookmarked" instead of "Starred".
    • nejsjsjsbsb 8 days ago |
      A better metric until it becomes a target. Once it is a target, getting a billion clones is trivial.

      Github should just stop showing star counts. Who cares about them.

      • attentionmech 8 days ago |
        I think it's like a "upvote" thing which shows whether historically users have found the repo interesting. Even if you hide stars, there needs to be a way for the collective hivemind of github users to help each other with what repos are high quality or not right?
        • rpdillon 7 days ago |
          You don't need to crowdsource everything. I've never used stars as a good metric because it's literally zero effort. It's anybody who happens by just stars it, So all you can really conclude from star count is that this is interesting to this number of people.

          Two metrics that I think correlate extremely highly with quality: The number of commits in the repository and the date of the most recent commit. I've used a metric based on those two inputs for the past 15 years to evaluate repos and I am not disappointed. Depending on the nature of the project, I weigh the two attributes differently. Some projects are arguably, 'done', and so the date of the most recent commit is not very important in that case.

          • michaelmior 7 days ago |
            I think "interesting to this number of people" is not a meaningless metric, but I would agree on the two other metrics you cite.
            • ryandrake 7 days ago |
              There is a big difference between “highest quality” and “most popular.” Online services constantly confuse the two because it’s easier to measure popularity.
              • michaelmior 7 days ago |
                I don't disagree. But I think there's at least some positive correlation. The highest quality is unlikely to be the least popular and vice versa.
        • LtWorf 7 days ago |
          Except that most people don't bother starring stuff, so the few who do are drowned by noise of fake stars.
      • ghxst 7 days ago |
        I sort by most amount of stars quite frequently when I am learning a new language and want to know what the most popular package is for something. What do you think would be a better metric for a use case like that?
        • arccy 7 days ago |
          number of actual imports in code
          • flippyhead 7 days ago |
            CodeRank(tm)!
            • nejsjsjsbsb 7 days ago |
              This might work but biases against languages whose package managers are not used in the rank. As well as code that is used alot but not referenced via code directly e.g. drop in dlls.
          • james_marks 7 days ago |
            Goodheart's law - this would just cause imports in junk repos
        • michaelmior 7 days ago |
          I think it's a decent metric. I agree with the other comment that actual imports is probably a better metric, but that's not always as trivial to find.

          That said, the package repositories for many popular languages list stats of either declared dependencies or package downloads, which helps.

          • LtWorf 7 days ago |
            rdeps are completely broken in github. I wrote a library that I have used in other projects of my own and it was always at 0 users.

            Anyway if stuff is used by proprietary stuff it will also sit at 0.

            I now moved to codeberg where there is less spam, although it does have stars

            • michaelmior 7 days ago |
              I wasn't suggesting using GitHub. In fact, I don't think I was aware that GitHub even listed reverse dependencies. Where is that in the UI?
              • LtWorf 7 days ago |
                • michaelmior 7 days ago |
                  Ah yes. I forgot I actually use this feature in some of my projects. GitHub isn't always able to update this automatically but you can programmatically report dependencies.
                  • LtWorf 7 days ago |
                    Yeah not way to automatically detect "import xxx"
                    • michaelmior 6 days ago |
                      Right. I believe they rely on things like requirements.txt, Cargo.toml, etc. And of course not all languages are supported. So if you use an unsupported language or package manager, you're out of luck.
                      • LtWorf 5 days ago |
                        It doesn't even work with normal requirements.txt by the way. At least not all the time.
        • burnte 7 days ago |
          Don't count any of my stars then, I thought it was a bookmark feature. Every repo I've starred is only starred to find later, not an endorsement from me.
          • ghxst 7 days ago |
            That's partly what I assume people star repos for, it really doesn't defeat my use case. It's a metric that indicates interest and popularity more so than approval or endorsements.
      • LtWorf 7 days ago |
        VC apparently.
      • knowitnone 7 days ago |
        I care. How do I know which is the better tool between 10 different tools?
        • nradov 7 days ago |
          You missed the point. Number of stars doesn't indicate anything about whether a particular tool is better (and never did). If you're using stars for that purpose then you're doing it wrong. Find another solution.
        • nejsjsjsbsb 7 days ago |
          Have in your mind your requirements. E.g. license, community, how funded, bugs, try a few out and check the ergonomics, talk to others who have used it.
    • pan69 8 days ago |
      A similar thing happens on npmjs.com where it shows downloads for packages, which is often used as a metric of quality. However, everytime a build pipeline runs and it pulls the package, that's a download.
      • attentionmech 8 days ago |
        May be with these rules: - Per user account we only count one clone - We don't count anonymous clones

        But I agree it's not like this is also without any issues

      • michaelmior 7 days ago |
        I don't think it's a useless metric and it's one I use myself, but it can also be gamed pretty easily. So the more people making decisions based on downloads, the higher the likelihood of bots generating downloads just to juice the stats.
        • ozim 7 days ago |
          Well it is useful as first level filter just like GH stars.

          If it has no downloads/stars you don’t care. If it has big amount let’s take time checking it out.

          Fun part starts when checking out part is limited to some minimum and goes to prod because it solves something. Where people might not even know if that library is any good at all.

      • LtWorf 7 days ago |
        And if your users know about "a cache" you won't get downloads. So iy's more beneficial if your users are the kind of noobs who redownload all the crap every single time rather than having fast CI
    • Lerc 8 days ago |
      The weird thing is I forked Freepascal to add an architecture of A VM I had written. It wasn't really useful to anyone else, but every now and again it earns a star from a random passer by.
      • attentionmech 8 days ago |
        Even I am curious now. Can you share me the fork? I want to see what you added there and how it's added.
        • Lerc 7 days ago |
          I just had a look, I didn't even push the changes back to GitHub after I made a local clone.

          The changes were very minor. My VM was an 8-bit Avr. I just needed to add a profile for an imaginary microcontroller with no peripherals, 64k ram and 64k words ROM.

          So what was on GitHub is an unmodified fork, 16 years behind upstream, and has acquired 20 stars. 8 in the last year.

    • GZGavinZhao 7 days ago |
      *sad noises from NixOS/nixpkgs, llvm/llvm-project, and all other repos with an absurd commit log/branches that takes ages to do a full clone

      (just a joke that immediately came to mind, not intended to undermine OP's idea)

      • attentionmech 7 days ago |
        default to git --shallow in the cli can be one option here.
    • simoncion 7 days ago |
      > (it's like proof of work, it needs compute to clone a repo)

      It's github's compute, so why do I (the person who's cloning the repo) care about the compute? I don't pay for it!

      • david_allison 7 days ago |
        I suspect GP is referring to counting the occurrences of `git clone` [on a fork?], rather than counting forks via the GitHub UI
        • simoncion 7 days ago |
          Oh. Is "number of times someone has cloned this repo" data you can query Github for that they don't expose on their GUI? I was entirely unaware that that was popularity-contest information that Github provided.
    • TZubiri 7 days ago |
      That is absolutely the wrong takeaway. The correct takeaway is that supply chain attacks and spam are real threats, and that these metrics can be gamed by malicious actors.

      The work in cloning a repo is negligible, and the requirement of work is not a security design guarantee in github. The actual cost of liking projects is network, malicious actors need to create fake accounts, waste IP addresses and ip blocks in the process. Whether you are cloning or liking is just the last mile.

      To me the takeaway is not to trust a project based on it's github metrics, and by extension not to trust projects just because they are linked and liked in hacker news for example. And to be wary of how I introduce dependencies into my projects.

      Not just because of strictly malicious dependencies, but also because of trash dependencies that don't add value.

      • james_marks 7 days ago |
        > because of trash dependencies that don't add value

        And at best, will still need maintenance in the future. One of the top lessons I preach to juniors.

        • galangalalgol 7 days ago |
          I like the idea of granular permissions for libraries. When you include a dependency you whitelist permissions it gets. Package managers could automate this if the language supports it. But making it about permissions instead of metrics .akes it not arbitrary. This library gets no filesystem access, that one gets no network access. This one runs build time system commands... Austral is the only language I know of that supports such a thing. While it might be possible to bolt it on to rust, I think it would take so much rework to make it infeasible.
      • yieldcrv 7 days ago |
        if a supply chain attack is susceptible to that, its purely the fault of the crowd the relies on those metrics
      • ATechGuy 7 days ago |
        For all speculation around supply chain attacks with fake Github stars, the article says:

        "our study does not find any evidence of fake stars being used for social engineering attacks"

    • robinsonb5 7 days ago |
      The weird thing is I've seen enough forks that have never seen any development that I'm pretty sure some people are using those as bookmarks rather than stars!
      • neom 7 days ago |
        I'm not a SWE but I use github still, I thought stars ARE bookmarks, what are stars then???? They're not for bookmarking????
        • diego_sandoval 7 days ago |
          I would think most people use it for bookmarking, but it seems like another portion of users use it as a "like" button.
          • notpushkin 7 days ago |
            It is kinda both. It also reposts the project for your GitHub followers.
        • attentionmech 7 days ago |
          They are currency of reputation and status. If you have enough stars, you get invited to private parties with elites. (I am just joking, they are bookmarks who got famous)
        • LtWorf 7 days ago |
          Nobody knows but since we are at the point where you can get VC money if you have enough, there is an incentive to get them.
      • Terr_ 7 days ago |
        AFAIK the "fork" option also helps guard against the original project getting deleted or somehow moved.
      • datadrivenangel 7 days ago |
        And forks on github have some bad ergonomics! Weird places where the upstream project still has control/influence over your fork. A full clone is better if you actually want control over the code fork.
    • kube-system 7 days ago |
      I would imagine those figures would mostly indicate which projects are most likely to be used in scripts or CI pipelines.
    • burnte 7 days ago |
      > They might as well just mark it as "Bookmarked" instead of "Starred".

      This is how I always interpreted the star feature and have used it as a bookmarking feature. I didn't know it was more akin to a like button!

      • dbaupp 7 days ago |
        People use stars differently.

        https://arxiv.org/pdf/1811.07643 is some investigatory research describing, among other things, 4 clusters of reasons for starring: to show appreciation, bookmarking, due to usage, due to third-party recommendation.

    • Suppafly 7 days ago |
      >For me starring a repo is liking bookmarking it, nothing else.

      Literally all I ever use the stars for, I don't know what they are 'supposed' to be used for if not that.

    • WA 7 days ago |
      For me, it’s "bookmarking obscure stuff". Why would I bookmark, say, React? I can find this easily. I only star stuff that has few stars and isn’t as easy to find later.
  • lprd 8 days ago |
    Do we need that type of metric anyways? Surely there are better ways to measure a repo's activity...
    • topspin 7 days ago |
      It seems like a conceptionally simple problem to grade a repo given the vast number of metrics available. Especially considering the advanced code analysis tools available today. I want a top-level analysis of some sort, based on: usage by other software (if applicable,) activity, issue frequency and resolution, derivatives (forks, etc.,) number of participants, code maturity, code testing, release frequency, license structure and many other parameters.

      There is an opportunity here for a third party to do this well.

      • ozim 7 days ago |
        Great idea but I don’t know anyone who will pay for that.

        Ones that care enough already have their internal tools and processes for security and checking/reviewing libraries.

        Ones that don’t care well won’t spend money on it.

        So any 3rd party would have to do all with own resources and not getting paid.

  • ocean_moist 8 days ago |
    The github social media features are so weird I get around 10 follow requests per week from random people who follow >2k people something off happening there.
    • mattbruv 7 days ago |
      I have the same thing happen to me often. Sometimes I get a notification on my GitHub homepage that someone followed me a day or so ago, and when I click to view their profile it seems that they have already unfollowed me. For example, This guy did it, and he has 6K+ followers and is only following ~200: https://github.com/NobleMajo. It seems weird that he would follow me to unfollow me right away. I have a feeling that these accounts do this intentionally to harvest followers by prompting Github to show a ton of different people that he is following them in order to have them follow back in exchange. I think most people will follow someone back who follows them without really thinking about it. In my case I investigated who it was who followed me and realized he isn't actually following me and is probably harvesting followers. Why would someone waste time out of their life to do this? Who knows. Probably want to feel special or stand out from other people without doing anything to earn it.
  • medv 8 days ago |
    This means 4.5M fake accounts. GitHub does a good job of detecting bots, but room for improvements still exists.
    • elashri 8 days ago |
      That's not what the paper said. The numbers are much lower because not all starts are by unique accounts.

      > In total, StarScout identified 4.53 million fake stars across 22,915 repositories (before the postprocessing step designed to remove spurious ones), created by 1.32 million accounts; among these stars, 0.95 million are identified with the low activity signature and 3.58 million are identified by the clustering signature. In the postprocessing step, StarScout further identified 15,835 repositories with fake star campaigns (corresponding to 3.1 million fake stars coming from 278k accounts).

  • ashvardanian 8 days ago |
    Not surprising at all, honestly. The incentive to farm stars is massive. According to the article, 10K stars can cost just $1K, whereas achieving those numbers organically often takes years of work, millions in R&D, and countless deployments. When this seemingly trivial metric becomes a key factor in unlocking capital from VCs, it’s no wonder people resort to shortcuts. In a way, the real surprise is that not everyone is buying stars.
    • deznu 7 days ago |
      What’s there to gain though?
  • halamadrid 8 days ago |
    Another interesting way - and I personally think its fraudulent. This is how it goes - run hackathons or sponsor events in Universities. There are a ton of colleges who are constantly seeking support to run events.

    Some companies take advantage of this by asking for stars in return of sponsorship. I have seen proposals that say for a $2000 sponsorship - 2000 stars guaranteed. The way it works is if a participant registers in the event they also have to show proof that they starred a specific repo that belongs to the company.

    • SOLAR_FIELDS 7 days ago |
      Ah, so is the quid pro quo here that if I want to run a hackathon in my city I can get Acme Corp to toss their brand name on it and fund it, but only if I get my participants to also star the repo? That does sound fraudulent. Do they put these agreements in writing?
      • kortilla 7 days ago |
        It’s sleazy but not fraudulent.
    • verst 7 days ago |
      At many hackathons I attended (as a mentor, company sponsor, or judge) approximately a quarter of participants had never used GitHub previously. Such participants often thought of GitHub stars as bookmarks. Many hackathon sponsors provide Hackathon samples or references on GitHub - or guidelines for winning the specific sponsor's prize. As expected, those repos were "bookmarked" by the students. In this case this is a misunderstanding of the purpose of GitHub stars by these new users, but certainly not a fraudulent action.
      • mintplant 7 days ago |
        I mean, that's not a misunderstanding, that is what they're for. Conferring clout is a secondary effect.

        https://docs.github.com/en/get-started/exploring-projects-on...

      • krick 7 days ago |
        Huh? These are bookmarks. I never actually use starts for any other purpose.
      • HEYGRANDMA 7 days ago |
        I've been on github for 10+ years and that's what I use stars for.
      • diath 7 days ago |
        What? That's exactly what I use stars for - to bookmark repositories so I can go through things I've found interesting at a later date, or to keep being updated on their releases.
      • cortesoft 7 days ago |
        Been using GitHub since literally the year it started, and stars have always been bookmarks.
        • Asraelite 7 days ago |
          It reminds me of how subscribing on YouTube and other platforms turned from "inform me when new videos come out" into "I like this channel".
      • lenkite 7 days ago |
        > In this case this is a misunderstanding of the purpose of GitHub stars by these new users

        Were we supposed to use stars some other way ?

        • jasonjmcghee 7 days ago |
          No. They are bookmarks.
      • verst 7 days ago |
        Let me expand on what I meant: Some folks treat GitHub stars as a repo quality or credibility endorsement.

        I also think stars should just be bookmarks - but some companies are obsessed with the star count (stargazers) as a sign of importance of their repo. Since that is obviously what sparked this research we are discussing to begin with.. just meant to point out that a lot of folks new to GitHub have no idea anyone could possibly care about the repo star count! :)

        • saurik 7 days ago |
          I do not understand how these two ideas are somehow different (and I even say that as someone who thinks caring about star counts is harmful to both sides): if there were a way to track/surveil how many users had bookmarked your website in their browser, it seems like it would be a sane and even rational (if also annoying and harmful and short-termist) thing to use that as a signal on both sides, to both optimize for as a website as well as as a potential user to judge the popular interest in a website... it certainly is a shitty heuristic, but even I will admit that, usually, a website that is bookmarked by a billion users is likely more "important" than one that is bookmarked by only a few users?

          But like, any indirect signal, by its very nature, both can be gamed and will be gamed once people care about it, and so you need to be extremely careful ever using such signals on an ongoing basis, certainly once too many people know about them; and, frankly, if you make such a website, you should be doing everything in your power to prevent such signals from existing in the first place (such as by not tracking what users bookmark in the first place; or, if you somehow must, then keeping that information to yourself and not exposing it to users... this is hard to have fortitude on, as it also requires not trying to use it yourself as a signal, as people will figure that out and optimize for it even if they can't see it).

    • wslh 7 days ago |
      That issue you are exposing is everywhere so if fraudulent is the word we can see that a lot of social media and Internet, in general, is also fraudulent. It is call-to-action and conversion everywhere.
    • est 7 days ago |
      > I personally think its fraudulent

      It's fraudulent only because Github use "stars" to rank popular repos.

      I think a repo should be ranked by code quality and update activity rather than "stars"

      • demaga 7 days ago |
        And which method should be used to determine code quality?
        • PeeMcGee 7 days ago |
          An easy one would be to promote projects that look like actual useful projects that contain code. Most of the ultra-starred things I see are awesome lists or otherwise non-useful markdown listicles and blogs.
        • est 7 days ago |
          If that's a problem, then what method should be used to rank repos?

          If "stars" were the only metric we can find, then people should stop complaining "fake" stars.

          • demaga 7 days ago |
            I'm with the other commenters on this issue. Stars are good enough as a marker, since you can use other markers (like # of contributors and recent commits) to make your own mind regarding quality of a repo.
            • bn-l 7 days ago |
              Also tests, coverage and tests passing!
            • est 7 days ago |
              > since you can use other markers

              Yesh, surely I can, but how about https://github.com/trending ?

              It still uses "stars" are a sole marker, no? I don't think it's "good enough"?

        • saagarjha 7 days ago |
          Definitely not stars lmao. Why would you want to use a measure of publicity as a measure of quality?
          • hulitu 5 days ago |
            > Definitely not stars lmao. Why would you want to use a measure of publicity as a measure of quality?

            It is Microsoft.

        • michaelt 7 days ago |
          It's simple.

          Code quality is (Number of stars + number of forks * median number of commits per fork + 10 * number of closed issues + 100 * number of open issues + 3 * number of dependent packages + 0.01 * number of installs in the last 30 days) * number of commits in the last 100 days * number of core contributors active in the last 7 days / percentage of lines of code with no test coverage / (1+the number of open CVEs in dependencies) * the vibe factor

  • simoncion 7 days ago |
    IMO, Github stars and number of "forks" are just as good a metric as "number of daily downloads" of a library or Docker image or similar.

    After noticing how many, many companies run many, many builds through their CI systems and (for a variety of reasons) end up re-downloading everything those builds require, regardless of whether or not it has changed since the last time they ran the build, I've come to the firm conclusion that these metrics are just plain bad if one uses them as a basis to make any significant decision.

  • semiinfinitely 7 days ago |
    sometimes I star my own github repos does that count as fake
    • bdangubic 7 days ago |
      it doesn't if you really like it :)
      • semiinfinitely 7 days ago |
        i do i really like my code thats why i typed it
        • saagarjha 7 days ago |
          I wish I loved all my children equally
  • openrisk 7 days ago |
    If you were wondering about fake forks, spoiler alert

    > counts in Cluster 1 come from merchants that only sell stars, while accounts in Cluster 2 come from merchants selling stars and forks simultaneously

  • johncoltrane 7 days ago |
    $PROJECT was bookmarked 666 times with GitHub's internal bookmarking mechanism doesn't say much about a project.

    The fact that so many people give those bookmarks so much value that an entire ecosystem was built around "fake" bookmarks is mind boggling.

  • gitgud 7 days ago |
    GitHub Stars are just one of many signals that describe the quality of a project.

    If a project has 10,000 stars but 1 commit and a terrible README… then the star count doesn’t have as much weight…

    You can’t trust any signal in isolation (like star count), but looking at many signals together is quite reliable

    • tacker2000 7 days ago |
      This. Stars can be an initial indicator, but I also always skim the issues and the last commits, general activity, contributors, before deciding to use a some lib vs another for example.
  • ivanjermakov 7 days ago |
    In my experience, open/closed issues ratio is much more important than star count.

    Star count is how interested people are in this project, does not signify much about its quality. I would not star the repo of a tool I use everyday, but would star some obscure project to try it out later.

    • jenadine 7 days ago |
      > In my experience, open/closed issues ratio is much more important than star count.

      What would you say is a good ratio?

      • ivanjermakov 7 days ago |
        Depends on the project, but generally 1/4 is a threshold for "there are people working on issues".

        On smaller solo-dev projects it's often single digit open to hundreds of closed - very good.

  • Der_Einzige 7 days ago |
    I wrote a whole benchmark which is not only resistant to this, but would automatically detect most fake stars!

    https://github.com/Hellisotherpeople/Bright

    • yencabulator 7 days ago |
      So, make 100 bot accounts that each have 100 dummy repos and have each star all of other's repos, then start pumping those votes out of the network?

      Each bot has 10k incoming stars from users who each have 10k incoming stars.

      Remember, Google PageRank was bootstrapped with the N most visited websites.

      • Der_Einzige 7 days ago |
        This project already had no one caring about it 5 years ago. If github adopted my metric, than what you described would fool it.

        I wrote that as a weekend project one day after seeing the "fake github stars" thing 5 years ago.

  • casenmgreen 7 days ago |
    I'm rather surprised it's only 4.5m.
  • knowitnone 7 days ago |
    github should be the ones doing this research, or at least sponsoring this but you know they won't because they don't care.
  • cvoss 7 days ago |
    I've never once starred a GH project, nor ever looked at or considered the star count of a project in order to evaluate it. What do people actually use this metric for? (This is not a rhetorical question. If you use it, I'd like to know why/how.)
    • wslh 7 days ago |
      How do you bookmark projects at GitHub? Basically, it is not the same a library with less than 10 stars than some popular ones with ten of thousands. It is true that you can game the system butnstats are a signal in a domain that you don't know well.
    • e-clinton 7 days ago |
      Starring projects is how I “bookmark” them so I can easily find them later. A high number of stars indicates many devs are interested in the project… if I had to chose a UI library with 10 stars versus one with 30,000 stars, it should be obvious why you’d pick the later. More interest likely means a bigger community behind the product.
    • MattRix 7 days ago |
      I star repos as a way to bookmark them. I use the star counts as a rough metric for how popular a repo is. Something with a high star count will tend to have a larger community, more support, documentation, etc. Not always the case, but it can be a useful heuristic.
    • alexk101 7 days ago |
      I like to keep track of interesting projects. It's basically just a bookmark for me.
    • shepherdjerred 7 days ago |
      When I'm looking for a library I'll consider them in order of # of stars.

      e.g. I might search for "TypeScript ORM", open five popular repos, and then compare the two libraries with the most stars to make my decision.

      I also use stars to bookmark projects, and I'll usually unstar a project once I've tried it out/no longer need it.

    • bawolff 7 days ago |
      When repos are mirrored to github (so the commits aren't directly made by your account), your contributions to those mirrored repos aren't listed on your profile unless you star the repo.

      So i star repos that are developed outside github so my contributions show up on my profile.

  • kittikitti 7 days ago |
    I think a repository should have at least 1 star, meaning at least the author liked it.
  • a1o 7 days ago |
    Not directly related, but I have seen Ads in Stack overflow for GitHub repositories - not any notable repository, clearly just random people trying to get some stars for reasons.
  • cute_boi 7 days ago |
    The best solution is to github buy start themselves and ban all those users after 1-2 months, so they wouldn't even have a clue.
    • imtringued 7 days ago |
      Cobra effect
      • yencabulator 7 days ago |
        What, the farmers will release their stars into the wild?
  • remram 7 days ago |
    What's a "real" star on GitHub? Most real users will click the star button after reading the README, it does not indicate that they like it or even tried it. Having a "star on GitHub" button in a judicious place on your project's website will already over-inflate your number of stars regardless of your project's quality. You don't have to be a professional developer to get a GitHub account either. The age of a project (or more precisely, how long it's been on GitHub) also has a massive impact on the stars.

    What was ever so good about the "N strangers clicked the icon" metric? Even when those users were human with a higher probability?

    > posing a security risk to all GitHub users

    Please tell me no one takes the "N strangers clicked the icon in the past" as a signal of "today's releases won't harm my computer".

  • nsoolo 7 days ago |
    It is very easy to distinguish false stars, the important thing of a repository is not the stars but the activity of the contributors, I have seen many times repositories with malicious payloads with 1.7K stars or more.
  • anshumankmr 7 days ago |
    Ha, guilty as charged, I used the github profile that I had from my previous two companies to give stars to some of my favourite repos on my personal account.
  • bawolff 7 days ago |
    What is even the point of stars in the first place.

    This is github not facebook. Who cares how many stars your open source repo has as long as it is useful to someone.

  • albert_e 7 days ago |
    split the current star function into a "star rating" and a "save bookmark" function

    let both users and repos opt out of / hide star ratings if they don't care about this popularity contest

    let bookmarks be always private to users -- so users can peacefully organize their bookmarks for what they are

    • victorbjorklund 7 days ago |
      True. I usually only star projects that "cool. I wanna try this sometime but not cool enough to try now"
  • RecycledEle 6 days ago |
    When people can make money off Internet credit, the con-artists take over and get more fake credit than the real creators earn. Then the con-artists drive away the real creators.

    This has happened on many other sites, and now I suspect it's happening on Git Hub.