Here's the actual repo: https://github.com/hehao98/StarScout
Allow it a week to finish all iterations and expect it to read >= 40TB of data. You can use nohup to put it as a background process. After a week, you can run the following command to collect the results into MongoDB and local CSV files:
I just love the yolo nature of "well let's check in a week if that 40TB of data processing worked"But if you brag that your project as a million and your competitor has half a million, that is so illogical that I would discount your project and think it’s run by dummies.
Are there practical situations where people really need stars enough to buy them?
If your GitHub repo can in any way provide you with income (from just having something to talk about in an interview or an innocuous “buy me a coffee link”…all the way up to selling $100,000/yr enterprise support plans), you now have a strong incentive to game the system.
And if it’s allowed by the system, then it’s a prisoners dilemma. Because if you DON’T do it, your competition will do it and eat your lunch.
That’s why it’s so important to design high integrity ranking systems.
I have plenty of Github projects bookmarked, but I never "stared" one... Why would I?
https://www.bleepingcomputer.com/news/security/over-31-milli...
I believe the only thing anyone can do is take metrics of how the metrics are gamed, as this particular paper has done.
How might a 'connection' look on GH? Will people freely connect, or will they appraise requests more closely?
But of course this is too complex and “no one will use it” (tm). So we’ll better have a screwed up recommendation system that doesn’t work at all, cause that’s simpler!
There are a handful of people that I know IRL that I follow on GitHub. And a few hundred that I follow in total. Out of the handful of people I know IRL, and who I follow on GitHub, only two or three of them are active there any given week. All of the other people I follow I have very little idea who they are. Usually I follow people I don’t know if I come across their profile and either the profile itself or their projects make me follow them. But I star way more different repos than the number of people I click follow on.
For me, the main way of discovering new repos are:
- Frontpage of HN, and comments in posts on HN.
- Specific search results on Google when I have searched for libraries or programs that do specific things.
- Libraries on crates.io that I think might be interesting to look into in the future.
Maybe once or twice a month I happen to click on the main page of GitHub itself and see mentions of repos that have been committed to or starred or created by people I follow.
So for me I don’t think “friends of friends” is a particularly great signal for things to look at. Most of the people I follow, I don’t know much about them.
Likewise, for anyone that follows me it’s not necessarily any strong signal that I follow someone else in order to determine if activity from that someone else should be shown or weighted as more significant to my follower just because I happen to follow that other person.
If you do want a strong signal for who to boost for my followers based on my own activity, go and look at the dependencies that I am using in my own projects. That’s a pretty good indicator that I put some amount of effort and interest into looking at something. This could be done by GitHub itself, parsing the Cargo.toml files of my projects and extracting the dependencies section and looking up which of those dependencies are hosted on GitHub.
Set any law you want, our nature will push us to circumvent it even legally.
Only if we let that minority keep manipulating the system without consequences, it becomes the driving market force that the rest of the population also feels they have to comply to, to go along, as it already has happened in finance, academia, etc.
For varying and self-serving definitions of fair. (Almost everyone in the rich world is in an unfairly-advantaged minority.)
Actually one of the keys is repeated contact. People who have to interact again and again will try and game the system less. Not sure how to build that into a star system but why give up so easily? Do programmers give up when you say "this algorithm can't be made any faster?"
The other is hierarchy. You can't automate reputation scoring.
It doesn't matter if people have to interact frequently if there is no real consequences to that interaction. The punishment in those collectivist cultures involves social shunning, shaming, etc. Individualistic cultures almost pride themselves on how much they can disregard social shunning and shaming. Shameless people are celebrities and elected officials. They are admired as opposed to shunned and ignored. A bad actor in an Amish community is expelled and loses access to what that community offers. That would be illegal in the general society unless their "bad act" was actually illegal. Discriminating against someone for being a dickhead who exploits loopholes and unregulated corner cases (without explicitly breaking the law) would be illegal in many contexts.
> Not sure how to build that into a star system but why give up so easily? Do programmers give up when you say "this algorithm can't be made any faster?"
I don't think people have given up. Online fraud detection is a massive industry as is. Spotify plays, YouTube views, Google search, Amazon reviews, reddit upvotes, twitter's retweets, facebook likes/shares, etc all fall exactly into the same bucket. There is even a significant dollar amount attached to many of those more so that GitHub stars. All are frequently gamed/faked and it's a battle between the platforms and the adversary
Large "Collectivist" communities have body count in the hundreds of millions.
https://en.wikipedia.org/wiki/High-trust_and_low-trust_socie...
For example, Tokyo has a lot of people and they actively dislike interacting with strangers but if you leave your laptop unattended while peeing at a coffee shop, it's very unlikely to have been stolen.
- Only accounts that have a decent amount of activity (pushing code, commenting, etc)
- Has set up SSH
- Older than 2 years
- Account active consistently for at least a year
- Must have 2-factor enabled
- Filled out profile
etc
And while some boy accounts may have them, I doubt many have most.
Also, you argue on semantics but the general idea of setting up a legitimacy test that factors in various things is very easily doable, the factors can be kept private, and you definitely can find ones that are generally hard to game.
Then you have people complaining about being "shadowbanned" (because there's no recourse if you're a person and the algorithm thinks you're not active enough), or that github is being anti-privacy (by requiring phone number). It's hard to win here.
It's not a matter of "here is a list of requirements that no one knows about, and here is slight randomness/delay to obfuscate".
How much do you think it takes to pay an actual human from a poor country to come to work each day at 8am, create one github account after another, enter them in a database, and leave at 5pm?
If you want to "study" how github handles stars because there is legitimate financial incentive for you in it, for $100 a day you can pay 10 or 20 of those people to create few thousands accounts a day. Do it few times a month, and throw these accounts in an automated system that creates random repos, pushes a few commits here and there, etc. Also "introduce some slight randomness or delay to obfuscate these events". Do some A/B testing to figure how the 300k accounts under your control affect a repo star system, then advertise a "GitHub stars service" "$0.50 per guaranteed star on Github". Your average VC funded startup could get 10k stars for $5k.They probably give AWS 10 times that a month.
Once github changes their requirements, do more testing, figure out what the requirements now are, then you're back in the game. If people do it all the time to Spotify, YouTube, Google, Amazon, Reddit, and Twitter, why do you think GitHub would somehow crack that nut?
people do it all the time to Spotify, YouTube, Google, Amazon, Reddit, and Twitter, why do you think GitHub would somehow crack that nut?
Because the listed projects do basically nothing, a bare minimum. They don’t even care as long as bots don’t play against their direct interest. Who cares at a media company, or a sales company, who exactly is at their top, as long as they are both not bad enough? Profits come either way. They all are shittiest examples of it who created, incorporated and are themselves part of this problem.
It’s akin to immune system. Its goal is not to protect you from every hiv and cancer, but to avoid constant infections from stupid low-effort attacks. You don’t have to make it prefect, but it must be there. The more cryptic it is, the less welcoming it is to game it through basic means, the better.
Well-connected people will get the tip off. And your PR team will have to keep batting down conspiracy theories, since if there's one thing the nutters love it's black boxes.
In GitHub organization settings you can require to only use secure 2FA which kicks anyone who use SMS 2FA out.
Like, imagine a group of professionals of decent sized, all specializing in a similar field, and having lots of strong connections between each other where they have ample opportunities to share information. It would be hard for an outsider to come in and astroturf their product without immense effort (like hiring shills to attend conferences). In-person networks also obviously solve the problem stars as reputation: reputation spreads naturally in these sorts of networks.
I think the problem comes with algorithmic scale. Maybe a solution would be to have more community building activities (maybe preferably offline).
Yeah, people would love that for sure.
> showing _regional_ stars like Apple/Google would be a start.
What does that mean? I thought regions only impact ranking not the net amount of stars (assuming we're talking about Apple/Google Maps). Which as far as I know, github doesn't do ranking.
> What does that mean? I thought regions only impact ranking not the net amount of stars (assuming we're talking about Apple/Google Maps). Which as far as I know, github doesn't do ranking.
At least on IOS reviews and ratings are by country, I dont actually know about google play though. (I dont have an android to check since I am not poor)
Sir, this is an HN.
The wrinkle is that measures that don't easily quantify are more resistant. For example, showing provable use by other reputable or trusted projects, or a significant amount of resources allocated to maintenance, or ...
Really just anything that can't be reduced to a single number in a canonical way will in the long run prove far more useful for longer.
This of course shifts some of the burden onto potential users to assess things more critically, and forecloses direct numerical comparison. But the idea that you could just look at a number and make such comparisons was faulty from the get go.
But people who have chops for that probably have high enough paying jobs not to care. As most likely no one would pay for reviews of libraries.
my determination to use a project is 1. the readme 2. the issues
The only repos where that's not the case are usually very niche, and in that case it becomes very hard to judge if the library is just very stable or a minefield of bugs and undesired behavior that no one else reported because no one else is using it.
I think the best examples are the reference implementation of some algorithm. There’s generally room for improvement, but keeping it simple is the point.
Case in point: https://github.com/facebook/hhvm/. It got 15,000 stars in its first few years, but roughly 10 non-Facebook companies actually ever used it in production, and today only one non-Facebook company uses it (I work at that company).
https://github.com/TodePond/DreamBerd - 11.7k stars
May I ask how/in what in context?
I don't have hard and fast rules for how I interpret those values, it depends on my intentions but I find them useful things to consider.
Going back to the readme, nothing turns me off faster than a skeletal readme, it doesn't have to be "War and Peace" but it needs to be more than just how to install it.
Github should just stop showing star counts. Who cares about them.
Two metrics that I think correlate extremely highly with quality: The number of commits in the repository and the date of the most recent commit. I've used a metric based on those two inputs for the past 15 years to evaluate repos and I am not disappointed. Depending on the nature of the project, I weigh the two attributes differently. Some projects are arguably, 'done', and so the date of the most recent commit is not very important in that case.
That said, the package repositories for many popular languages list stats of either declared dependencies or package downloads, which helps.
Anyway if stuff is used by proprietary stuff it will also sit at 0.
I now moved to codeberg where there is less spam, although it does have stars
But I agree it's not like this is also without any issues
If it has no downloads/stars you don’t care. If it has big amount let’s take time checking it out.
Fun part starts when checking out part is limited to some minimum and goes to prod because it solves something. Where people might not even know if that library is any good at all.
The changes were very minor. My VM was an 8-bit Avr. I just needed to add a profile for an imaginary microcontroller with no peripherals, 64k ram and 64k words ROM.
So what was on GitHub is an unmodified fork, 16 years behind upstream, and has acquired 20 stars. 8 in the last year.
(just a joke that immediately came to mind, not intended to undermine OP's idea)
It's github's compute, so why do I (the person who's cloning the repo) care about the compute? I don't pay for it!
The work in cloning a repo is negligible, and the requirement of work is not a security design guarantee in github. The actual cost of liking projects is network, malicious actors need to create fake accounts, waste IP addresses and ip blocks in the process. Whether you are cloning or liking is just the last mile.
To me the takeaway is not to trust a project based on it's github metrics, and by extension not to trust projects just because they are linked and liked in hacker news for example. And to be wary of how I introduce dependencies into my projects.
Not just because of strictly malicious dependencies, but also because of trash dependencies that don't add value.
And at best, will still need maintenance in the future. One of the top lessons I preach to juniors.
"our study does not find any evidence of fake stars being used for social engineering attacks"
This is how I always interpreted the star feature and have used it as a bookmarking feature. I didn't know it was more akin to a like button!
https://arxiv.org/pdf/1811.07643 is some investigatory research describing, among other things, 4 clusters of reasons for starring: to show appreciation, bookmarking, due to usage, due to third-party recommendation.
Literally all I ever use the stars for, I don't know what they are 'supposed' to be used for if not that.
There is an opportunity here for a third party to do this well.
Ones that care enough already have their internal tools and processes for security and checking/reviewing libraries.
Ones that don’t care well won’t spend money on it.
So any 3rd party would have to do all with own resources and not getting paid.
> In total, StarScout identified 4.53 million fake stars across 22,915 repositories (before the postprocessing step designed to remove spurious ones), created by 1.32 million accounts; among these stars, 0.95 million are identified with the low activity signature and 3.58 million are identified by the clustering signature. In the postprocessing step, StarScout further identified 15,835 repositories with fake star campaigns (corresponding to 3.1 million fake stars coming from 278k accounts).
Some companies take advantage of this by asking for stars in return of sponsorship. I have seen proposals that say for a $2000 sponsorship - 2000 stars guaranteed. The way it works is if a participant registers in the event they also have to show proof that they starred a specific repo that belongs to the company.
https://docs.github.com/en/get-started/exploring-projects-on...
Were we supposed to use stars some other way ?
I also think stars should just be bookmarks - but some companies are obsessed with the star count (stargazers) as a sign of importance of their repo. Since that is obviously what sparked this research we are discussing to begin with.. just meant to point out that a lot of folks new to GitHub have no idea anyone could possibly care about the repo star count! :)
But like, any indirect signal, by its very nature, both can be gamed and will be gamed once people care about it, and so you need to be extremely careful ever using such signals on an ongoing basis, certainly once too many people know about them; and, frankly, if you make such a website, you should be doing everything in your power to prevent such signals from existing in the first place (such as by not tracking what users bookmark in the first place; or, if you somehow must, then keeping that information to yourself and not exposing it to users... this is hard to have fortitude on, as it also requires not trying to use it yourself as a signal, as people will figure that out and optimize for it even if they can't see it).
It's fraudulent only because Github use "stars" to rank popular repos.
I think a repo should be ranked by code quality and update activity rather than "stars"
If "stars" were the only metric we can find, then people should stop complaining "fake" stars.
Yesh, surely I can, but how about https://github.com/trending ?
It still uses "stars" are a sole marker, no? I don't think it's "good enough"?
It is Microsoft.
Code quality is (Number of stars + number of forks * median number of commits per fork + 10 * number of closed issues + 100 * number of open issues + 3 * number of dependent packages + 0.01 * number of installs in the last 30 days) * number of commits in the last 100 days * number of core contributors active in the last 7 days / percentage of lines of code with no test coverage / (1+the number of open CVEs in dependencies) * the vibe factor
After noticing how many, many companies run many, many builds through their CI systems and (for a variety of reasons) end up re-downloading everything those builds require, regardless of whether or not it has changed since the last time they ran the build, I've come to the firm conclusion that these metrics are just plain bad if one uses them as a basis to make any significant decision.
> counts in Cluster 1 come from merchants that only sell stars, while accounts in Cluster 2 come from merchants selling stars and forks simultaneously
The fact that so many people give those bookmarks so much value that an entire ecosystem was built around "fake" bookmarks is mind boggling.
If a project has 10,000 stars but 1 commit and a terrible README… then the star count doesn’t have as much weight…
You can’t trust any signal in isolation (like star count), but looking at many signals together is quite reliable
Star count is how interested people are in this project, does not signify much about its quality. I would not star the repo of a tool I use everyday, but would star some obscure project to try it out later.
What would you say is a good ratio?
On smaller solo-dev projects it's often single digit open to hundreds of closed - very good.
Each bot has 10k incoming stars from users who each have 10k incoming stars.
Remember, Google PageRank was bootstrapped with the N most visited websites.
I wrote that as a weekend project one day after seeing the "fake github stars" thing 5 years ago.
e.g. I might search for "TypeScript ORM", open five popular repos, and then compare the two libraries with the most stars to make my decision.
I also use stars to bookmark projects, and I'll usually unstar a project once I've tried it out/no longer need it.
So i star repos that are developed outside github so my contributions show up on my profile.
What was ever so good about the "N strangers clicked the icon" metric? Even when those users were human with a higher probability?
> posing a security risk to all GitHub users
Please tell me no one takes the "N strangers clicked the icon in the past" as a signal of "today's releases won't harm my computer".
This is github not facebook. Who cares how many stars your open source repo has as long as it is useful to someone.
let both users and repos opt out of / hide star ratings if they don't care about this popularity contest
let bookmarks be always private to users -- so users can peacefully organize their bookmarks for what they are
This has happened on many other sites, and now I suspect it's happening on Git Hub.