Hacker News new | past | comments | ask | show | jobs | submit login
Internet search tips (gwern.net)
226 points by herbertl 61 days ago | hide | past | favorite | 83 comments



+1 for his tip of "Search The Internet Archive".

The other day I searched for the Tom & Jerry full episodes on the web to no avail (streaming platforms and video platforms like youtube only have rubbish cuts).

The internet archive has every episode starting from the first one in 1940, in an easily accessible player without any ads or recommendations: https://archive.org/details/tom-and-jerry-all-114-episodes


TIL: IA is more than Wayback Machine.

Something I'd love to learn to do better is search WBM. I use WBM only a couple of times per month, but when I need it it's the only tool that can do the job and is therefore very valuable. Trouble is, I don't really know how to search it unless I have a record of the exact URL I want, which isn't always possible.


Your comment just reminded me of how great the Internet Archive is, so I guess it's time to donate another 100 bucks to IA again.


Just as a quick link for others, it's possible to set up a single or monthly donation to the internet archive at this link: https://archive.org/donate/


I give from time to time to Wikipedia, when I get an appeal from Jimmy Wales. Even more important than IA to me, which is saying A LOT.


it’s within the realm of a few apple stores to serve the internet archive?


Internet Archive is generally invaluable for finding older media. I personally use it to find movies from the 40’s and earlier. There’s always a dozen different uploads of the movie regardless of whether it’s still under copyright protection. I guess the rights owners don’t care that much, but it could potentially be ammo to take the whole Archive down.


my problem is some operators just dont work any more, either on purpose or because of crappy quality control. for example, you used to be able to do:

    allintitle:Neil Diamond If You Go Away
on YouTube, and get exactly what you would think, results with all those words in the title. but now, you dont:

https://youtube.com/results?search_query=allintitle:Neil+Dia...

now, I get crap like this:

    Neil Diamond & Shirley Bassey - Play Me - "high quality"
    Barbra Streisand - If You Go Away (Ne Me Quitte Pas)
how is that what I searched for? also, what is this:

> A search for [site:nytimes.com] will work, but [site:nytimes.com] won't.

https://support.google.com/websearch/answer/2466433

did I just have a stroke? those two searches are exactly the same. I try to be understanding, but I am constantly tripping over big companies glaring software and/or documentation issues, it gets old.


> https://support.google.com/websearch/answer/2466433

> those two searches are exactly the same

Yes, that appears to be a recently-introduced typo -- the archived version from April does have a space: https://web.archive.org/web/20230412181331/https://support.g...

I submitted a feedback comment, hopefully they'll fix the typo.

(For future reference, here is a snapshot of the current version without a space: https://web.archive.org/web/20230722085337/https://support.g...)


Has anyone else experienced DuckDuckGo ignoring the exclusion operator? For example, searching `kiwi -fruit`, with no space between the hyphen and second word, used to bring up results that did not include the word "fruit". This no longer seems to be the case.


> duckduckgo ... exclusion operator

Removed a few weeks ago.

Somebody posted in these pages the github diff showing the removal of the options.


Someone needs to build a meta search engine…


At least one existed in the dot com era. Forget the name.


metacrawler, IIRC.


searx/searxng?


> did I just have a stroke? those two searches are exactly the same.

I just had the exact same though while reading the page.


The fact Google's own documentation is this poorly kept is a astonishing.

Makes me a firm believer companies should have their documentation on GitHub (or similar) so anyone can make a PR to tidy these things up.


The one not working has a space after the colon. It's even in the section taking about this :)


no, it doesn't.


It looks like it's damaged in the English translation/version.

In a few other languages I checked it has :)


Agree. This even applies to simple Booleans, which are sometimes completely ignored.


I sometimes wonder about PMs who sign off on decisions like these, and the seeming lack of protest the developers put up.


I feel like Google became absolutely unusable as a search engine. You search for "keyword1 keyword2", and at least half of the results are either "Missing: keyword1" or "Missing: keyword2".


Until a year ago using DuckDuckGo instead of google felt like only an equal or inferior option. But at some recent point I have found that DDG has slightly improved and Google has gotten much, much worse. With DDG and Bing Chat, google's future is looking very Internet Explorer 6.


DDG is following Google's path, with removal of the exclusion operator. Nothing like searching for "foo -quux" and finding "quux" in ALL of my search results, on DDG and Google alike.


Make your tool so idiots can use it and only idiots will use it


Google is especially bad that this. Often what will be remove is the most important keyword, I assume because that yields more result.


They moved over to really dumb word vector stuff, they mention it in the TPUv4 paper and it is pretty surprising but probably monetized better somehow.


Really good list. Here are some others I've discovered:

Some libgen clone sites like z-lib have fulltext search on books with support for exact matches: https://zlibrary-asia.se/fulltext/?q=%22frank+sinatra%22&typ...

Even if you are going to purchase book on a subject, this finds so much stuff that is not in Google because of copyright delisting and is sometimes useful in knowing which books to consider.

Yandex results especially the non English ones can be good if you are willing to use a translator.


If I had my way, I would introduce internet search techniques as a core module in all university programs.

Too often I see in my students work evidence of lazy searching. It is as if they expect Google to be able to read their minds or even foresee their future intentions. Lack of variety of search terms is a key shortcoming. Also, lack of exploration of terms which are tangentially related to their search topic.

An internet search should be playful and exploratory. Above all, it should be understood that the internet is beyond simple linear indexing.


But search techniques change over time. You can see that in the responses in this thread. What used to work doesn't anymore.


It is as if they expect Google to be able to read their minds or even foresee their future intentions.

That's exactly what Google and other interests want --- gullible, uncritical sheeple that can be exploited to extract $$$ and worse.


So I actually did have this as a class, or at least part of one.

I think there is / was an official standard or name, I can't remember it though. It never worked with Google really, but it worked on the search engines for our university and other academic sites.


Another useful Google search trick not mentioned in the article is numeric range queries.

You can use two numbers separated by two dots to represent all numbers in the range.

For example, a search for

  taki 100..200
gives me results for taki 183.

This is useful when you can't remember an exact year or number.


Related:

Internet Search Tips - https://news.ycombinator.com/item?id=26847596 - April 2021 (77 comments)

Internet Search Tips - https://news.ycombinator.com/item?id=18666574 - Dec 2018 (28 comments)


Gwern is the person I'd become if I could properly monetize my ADHD and rabbit hole seeking behavior. I absolutely love his website and everything he publishes, and I am always delighted to look over his new stuff. Wish there are more people like him to read from.


His website is the best designed website I have genuinely ever visited, so fast, easy to navigate and very little blank space which makes it information dense. I also love the inline link opening, I definitely will be implementing that when I make my own website in the future.


>His website is the best designed website

The forced R and L margins suck for people with oldster eyes who have to increase the text size.

Kids these days!


I don't really understand your comment. The size of the margins doesn't really have anything to do with the size of the text. (If we increased the base font size, as we've done a few times already, the margins would probably be kept the same.) If you increase the text size with the standard browser zoom commands like Ctrl-+/-, the margins are not fixed or anything like that, and you will get bigger text & lines.


Yes, with Ctrl+ I get bigger text, but the margins stay the same and make the text flow further down the page so I have to scroll too often to read in a fluid manner. Compare to HN margins, which are smaller and when you enlarge beyond a certain point, the margins get even smaller.


That's odd, because that's not how browser zoom works. I checked in both Firefox & Chromium, and if you zoom in, the margins do get smaller & smaller and eventually go away completely when you reach mobile-mode where the text is edge to edge. (Are you sure you didn't do something like enable "Zoom text only" in your browser settings? That it works 'properly' on HN doesn't mean much because HN uses weird table layout for everything.)


"Zoom text only" was enabled. I checked a freshly installed Firefox and that is not the default so I am unsure how it was enabled, but it was. Thank you so much for helping me solve this!


Glad to hear that. I was worried that you would say you were using Safari. We keep having problems with Safari doing the wrong thing and we don't have convenient access to Safari instances.


I have not seen the page on PC, all I can say is that it works very well on mobile.


Are you perchance able to provide some kind of RSS feed to his website? I'm having a hard time finding his newest stuff. You can subscribe to his newsletters but he stopped doing those two years ago


There is the firehose from his patreon though that might not be exactly what you are looking for. Personally, I just check 'newest' on his frontpage every now and then.

https://www.patreon.com/gwern


Many good scarce tips in there (once past Google Scholar, unless you're monopoly-oriented). 'Dealing with paywalls' for example.

He did mention IA searches, but I didn't spot a mention of their https://scholar.archive.org/ with "over 25 million research articles and other scholarly documents preserved in the Internet Archive."

For book metadata, I find https://openlibrary.org/ search has a lot of 'MARC' type data. Esp useful for books with many editions.

Some services are getting more restrictive. e.g. WorldCat recently got harder to use, rejecting many searches. But if you can find a book's OCLC ###, then https://www.worldcat.org/oclc/### works every time. Useful for finding local hardcopy if you've got your location turned on. (With an ISBN , WPedia will also do this for you at: https://en.wikipedia.org/wiki/Special:BookSources/ )


I also recommend https://www.semanticscholar.org/ because you can easily retrieve some of the PDFs.


There are a couple of comments on how search engine x dropped feature y on date z (i time periods after competitor j did).

Can anyone recommend a site that tracks those changes?

I find myself getting annoyed by serps of a given vendor and I might even be adopting to changes but ultimately jumping ship towards the next best thing if I can’t figure out a very low-effort way of influencing the results in my favor using flags that suddenly stop working - all I notice is a significant degradation of result quality.


Wondering why we don't have web search engineers


We do have 'researchers' though


Mostly people don't think it's possible to scam your way into a job with that title.

But that is a great reposte for "prompt engineer"; my coworker and I joke about being prompt engineers, but really we are technical people putting new tools to use.


Years ago the topic of Internet search was one of the core philosophies of Search Lores:

https://web.archive.org/web/20191201105759/http://search.lor...

Gwern's site somehow reminds me of this Fravia's site.



I do! RIP. But even while fravia was alive, his search tips were largely outdated, and that left a gap which needed to be filled. No one else had written up a search guide for the modern Internet which touched on even a tenth of what I routinely used in my own writing to find PDFs or repair dead links, so...


a mirror without archive.org: http://biostatisticien.eu/www.searchlores.org/indexo.htm

Thanks very much for this; I'm very much interested in searching techniques, but wasn't aware of this site.


Learned a lot of new tips today.

I have found a few obscure publications at the [Defense Technical Information Center](https://discover.dtic.mil).


Great article, but can't help feel it just highlights how broken Google search is from a UX perspective.


Soon, there will be an LLM that can utilize indexes and large scale distributed databases, and that will replace every search engine on earth.

Google should be quaking, and pissing themselves. It won't be their bardum, and they won't have the financial power to purchase said LLM.

Poof. Goodbye useless SEO spam, advertisement hell. We hated your guts.


I'm not sure how LLMs are going to be immune to SEO spam and advertising. As if human nature would magically transform and people would stop buying stupid stuff.

LLMs are already being used to make the web even less useful, by shitting out vast ammounts of meaningless and even outright wrong text for SEO purposes. And don't forget the systems are being trained on the web in the first place… using a LLM that is able to utilize a web search already thinks listicles are useful information and not just a way to place affiliate links.

When the hype is over and the VC money dried out, companies will find ways to make the LLM interfaces and outputs an 'advertising friendly' affair.


I think the idea is that, if we (some of us) can figure out when something is SEO spam or rather, generally low quality, an LLM should be able to but faster and more quantitatively why.


And who, pray tell, would have an interest in giving us that? The internet overlords are advertising companies. Would you pay for such an LLM? Those companies have gotten so powerful because everyone wants stuff for “free”.


> Those companies have gotten so powerful because everyone wants stuff for “free”.

On the contrary, my experience of the Internet since pre-web is those companies have gotten so powerful because now everyone 'creating content' wants to get paid.

Put another way, the gwerns of the world may start by posting ad-free content just because they feel like sharing. "Those" companies can't profit from individual pamphleteering.

But usually, as soon as a gwern has a measurable readership, they imagine money, and start supporting the ad industry. Surprisingly quickly, chasing more ad revenue becomes the point of their content instead of just sharing whatever they had to say.

Today, it seems most people start by wanting to get paid, and come up with a type of content to create.

Curiously, and as a result of the type of targeted searching gwern describes here, you'll generally find the best content is that which is still published free (self hosted or as open papers) thanks to motivations of the content's creator.


> Goodbye useless SEO spam

You think LLMs won't also be used to generate more SEO spam? I think this will be an arms race, or a game of LLM-cat and LLM-mouse.


> Goodbye useless SEO spam, advertisement hell. We hated your guts.

That sounds great, but are you sure that there won't be legions of LLMs generating SEO spam and advertisement hell in warfare with the LLMs that do search?

Just as an aside: Sci-fi seems to have one superintelligence (one Skynet) fighting humankind. If superintelligence emerges, I think it's likely that we'll have thousands, millions, or billions of superintelligences fighting each other, each with its own agenda.


The searching of data doesn't need to be controlled by insanely powerful insanely rich entities. Ex: Wikipedia, Craigslist, polio vaccine. If this LLM arms race leads to an organic solution that anyone can grow their own, it might not have as many "This 1 Trick Your Grocer Doesn't Want You To Know" ads in it.

And for superintelligence, let's clarify that this "Artificial Intelligence" moniker has almost nothing to do with actual intelligence, so we're probably a few years off from that.


For complement I would recommend Russian yandex, and vk.com (russian facebook). It is like snapshot of internet from 2018, before everything was hit by massive censorship. It has some very niche communities that are no longer on internet. For example Ancient Egypt history, western net is filled with alien theories, Russians are doing experimental archaeology. Also repair, electronics...

Also Telegram is like second internet, nobody talks about!


Yandex image search is actually useful, especially after Google destroyed their image search.


Telegram? How so? I'm still only using it for chatting. How would you even find anything on Telegram? The simple global search? I wouldn't like to join random shady groups found through there.


Not sure either, but trying to follow the war in Ukraine has taught me both Ukrainians and Russians get a surprising amount of their news from various Telegrams channels.


The search sucks, but you can find quite a lot of interesting stuff if you have a jumping off point by checking out the groups they forward from. I’ve discovered a few great groups like that.

Most of the stuff that comes up on search is spam or garbage though, agreed. I’ve found the majority of news groups I follow through twitter.


> Also Telegram is like second internet, nobody talks about!

Do you have any groups/channels that you can recommend?


For news?

https://t.me/CIG_telegram

https://t.me/BellumActaNews

Both provide decent coverage of global happenings.


The irony of someone recommending Russian platforms to evade "censorship" is astounding.


My thought as well. As a Russian I wouldn't touch them with a 10 foot pole. TG is full of CSAM and others are censored worse than anything in the US.


Yandex hits me with captchas every search and every 2-3 pages on Firefox.


How do you find google Telegrams? For me most seem bot spam


I found it useful to screenrecord all my computer activity. It is OCR searchable. This way if I vaguely remember some bits, I can reconstruct my sources. It also preserves original text, in case sources get edited.


I tried something similar, I had a text file called mylog.txt and whenever I hit alt + m it would append whatever was in the clipboard to the text file along with the date & time. But it turned out I was terrible at predicting what I would want to find again and it turned out to be not very useful to me.


I've thought about this or something similar (e.g. ArchiveBox).

It seems like 95-99% of the content you archive will be junk that you never look at again, but that 1-4% really might be worth it. Especially if it gets taken down.

What do you do for storage, though? I haven't invested in a NAS or personal storage over 1TB.

I think I understand now why some of these archive utilities offer you the option to upload to the internet archive in addition to storing locally. You can build up a local cache and then start removing the oldest page snapshots (but keep the link trail) once you start to run out of space.


I buy 4TB external HDDs and use them as archival tapes.


How often do you need to buy a new one?


What tool do you use?




Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: