The other day I searched for the Tom & Jerry full episodes on the web to no avail (streaming platforms and video platforms like youtube only have rubbish cuts).
Something I'd love to learn to do better is search WBM. I use WBM only a couple of times per month, but when I need it it's the only tool that can do the job and is therefore very valuable. Trouble is, I don't really know how to search it unless I have a record of the exact URL I want, which isn't always possible.
Just as a quick link for others, it's possible to set up a single or monthly donation to the internet archive at this link: https://archive.org/donate/
Internet Archive is generally invaluable for finding older media. I personally use it to find movies from the 40’s and earlier. There’s always a dozen different uploads of the movie regardless of whether it’s still under copyright protection. I guess the rights owners don’t care that much, but it could potentially be ammo to take the whole Archive down.
did I just have a stroke? those two searches are exactly the same. I try to be understanding, but I am constantly tripping over big companies glaring software and/or documentation issues, it gets old.
Has anyone else experienced DuckDuckGo ignoring the exclusion operator? For example, searching `kiwi -fruit`, with no space between the hyphen and second word, used to bring up results that did not include the word "fruit". This no longer seems to be the case.
I feel like Google became absolutely unusable as a search engine. You search for "keyword1 keyword2", and at least half of the results are either "Missing: keyword1" or "Missing: keyword2".
Until a year ago using DuckDuckGo instead of google felt like only an equal or inferior option. But at some recent point I have found that DDG has slightly improved and Google has gotten much, much worse. With DDG and Bing Chat, google's future is looking very Internet Explorer 6.
DDG is following Google's path, with removal of the exclusion operator. Nothing like searching for "foo -quux" and finding "quux" in ALL of my search results, on DDG and Google alike.
They moved over to really dumb word vector stuff, they mention it in the TPUv4 paper and it is pretty surprising but probably monetized better somehow.
Even if you are going to purchase book on a subject, this finds so much stuff that is not in Google because of copyright delisting and is sometimes useful in knowing which books to consider.
Yandex results especially the non English ones can be good if you are willing to use a translator.
If I had my way, I would introduce internet search techniques as a core module in all university programs.
Too often I see in my students work evidence of lazy searching. It is as if they expect Google to be able to read their minds or even foresee their future intentions. Lack of variety of search terms is a key shortcoming. Also, lack of exploration of terms which are tangentially related to their search topic.
An internet search should be playful and exploratory. Above all, it should be understood that the internet is beyond simple linear indexing.
So I actually did have this as a class, or at least part of one.
I think there is / was an official standard or name, I can't remember it though. It never worked with Google really, but it worked on the search engines for our university and other academic sites.
Gwern is the person I'd become if I could properly monetize my ADHD and rabbit hole seeking behavior. I absolutely love his website and everything he publishes, and I am always delighted to look over his new stuff. Wish there are more people like him to read from.
His website is the best designed website I have genuinely ever visited, so fast, easy to navigate and very little blank space which makes it information dense. I also love the inline link opening, I definitely will be implementing that when I make my own website in the future.
I don't really understand your comment. The size of the margins doesn't really have anything to do with the size of the text. (If we increased the base font size, as we've done a few times already, the margins would probably be kept the same.) If you increase the text size with the standard browser zoom commands like Ctrl-+/-, the margins are not fixed or anything like that, and you will get bigger text & lines.
Yes, with Ctrl+ I get bigger text, but the margins stay the same and make the text flow further down the page so I have to scroll too often to read in a fluid manner. Compare to HN margins, which are smaller and when you enlarge beyond a certain point, the margins get even smaller.
That's odd, because that's not how browser zoom works. I checked in both Firefox & Chromium, and if you zoom in, the margins do get smaller & smaller and eventually go away completely when you reach mobile-mode where the text is edge to edge. (Are you sure you didn't do something like enable "Zoom text only" in your browser settings? That it works 'properly' on HN doesn't mean much because HN uses weird table layout for everything.)
"Zoom text only" was enabled. I checked a freshly installed Firefox and that is not the default so I am unsure how it was enabled, but it was. Thank you so much for helping me solve this!
Glad to hear that. I was worried that you would say you were using Safari. We keep having problems with Safari doing the wrong thing and we don't have convenient access to Safari instances.
Are you perchance able to provide some kind of RSS feed to his website? I'm having a hard time finding his newest stuff. You can subscribe to his newsletters but he stopped doing those two years ago
There is the firehose from his patreon though that might not be exactly what you are looking for. Personally, I just check 'newest' on his frontpage every now and then.
Many good scarce tips in there (once past Google Scholar, unless you're monopoly-oriented). 'Dealing with paywalls' for example.
He did mention IA searches, but I didn't spot a mention of their https://scholar.archive.org/ with "over 25 million research articles and other scholarly documents preserved in the Internet Archive."
For book metadata, I find https://openlibrary.org/ search has a lot of 'MARC' type data. Esp useful for books with many editions.
Some services are getting more restrictive. e.g. WorldCat recently got harder to use, rejecting many searches. But if you can find a book's OCLC ###, then https://www.worldcat.org/oclc/### works every time. Useful for finding local hardcopy if you've got your location turned on. (With an ISBN , WPedia will also do this for you at: https://en.wikipedia.org/wiki/Special:BookSources/ )
There are a couple of comments on how search engine x dropped feature y on date z (i time periods after competitor j did).
Can anyone recommend a site that tracks those changes?
I find myself getting annoyed by serps of a given vendor and I might even be adopting to changes but ultimately jumping ship towards the next best thing if I can’t figure out a very low-effort way of influencing the results in my favor using flags that suddenly stop working - all I notice is a significant degradation of result quality.
Mostly people don't think it's possible to scam your way into a job with that title.
But that is a great reposte for "prompt engineer"; my coworker and I joke about being prompt engineers, but really we are technical people putting new tools to use.
I do! RIP. But even while fravia was alive, his search tips were largely outdated, and that left a gap which needed to be filled. No one else had written up a search guide for the modern Internet which touched on even a tenth of what I routinely used in my own writing to find PDFs or repair dead links, so...
I'm not sure how LLMs are going to be immune to SEO spam and advertising. As if human nature would magically transform and people would stop buying stupid stuff.
LLMs are already being used to make the web even less useful, by shitting out vast ammounts of meaningless and even outright wrong text for SEO purposes. And don't forget the systems are being trained on the web in the first place… using a LLM that is able to utilize a web search already thinks listicles are useful information and not just a way to place affiliate links.
When the hype is over and the VC money dried out, companies will find ways to make the LLM interfaces and outputs an 'advertising friendly' affair.
I think the idea is that, if we (some of us) can figure out when something is SEO spam or rather, generally low quality, an LLM should be able to but faster and more quantitatively why.
And who, pray tell, would have an interest in giving us that? The internet overlords are advertising companies. Would you pay for such an LLM? Those companies have gotten so powerful because everyone wants stuff for “free”.
> Those companies have gotten so powerful because everyone wants stuff for “free”.
On the contrary, my experience of the Internet since pre-web is those companies have gotten so powerful because now everyone 'creating content' wants to get paid.
Put another way, the gwerns of the world may start by posting ad-free content just because they feel like sharing. "Those" companies can't profit from individual pamphleteering.
But usually, as soon as a gwern has a measurable readership, they imagine money, and start supporting the ad industry. Surprisingly quickly, chasing more ad revenue becomes the point of their content instead of just sharing whatever they had to say.
Today, it seems most people start by wanting to get paid, and come up with a type of content to create.
Curiously, and as a result of the type of targeted searching gwern describes here, you'll generally find the best content is that which is still published free (self hosted or as open papers) thanks to motivations of the content's creator.
> Goodbye useless SEO spam, advertisement hell. We hated your guts.
That sounds great, but are you sure that there won't be legions of LLMs generating SEO spam and advertisement hell in warfare with the LLMs that do search?
Just as an aside: Sci-fi seems to have one superintelligence (one Skynet) fighting humankind. If superintelligence emerges, I think it's likely that we'll have thousands, millions, or billions of superintelligences fighting each other, each with its own agenda.
The searching of data doesn't need to be controlled by insanely powerful insanely rich entities. Ex: Wikipedia, Craigslist, polio vaccine. If this LLM arms race leads to an organic solution that anyone can grow their own, it might not have as many "This 1 Trick Your Grocer Doesn't Want You To Know" ads in it.
And for superintelligence, let's clarify that this "Artificial Intelligence" moniker has almost nothing to do with actual intelligence, so we're probably a few years off from that.
For complement I would recommend Russian yandex, and vk.com (russian facebook). It is like snapshot of internet from 2018, before everything was hit by massive censorship. It has some very niche communities that are no longer on internet. For example Ancient Egypt history, western net is filled with alien theories, Russians are doing experimental archaeology. Also repair, electronics...
Also Telegram is like second internet, nobody talks about!
Telegram? How so? I'm still only using it for chatting. How would you even find anything on Telegram? The simple global search? I wouldn't like to join random shady groups found through there.
Not sure either, but trying to follow the war in Ukraine has taught me both Ukrainians and Russians get a surprising amount of their news from various Telegrams channels.
The search sucks, but you can find quite a lot of interesting stuff if you have a jumping off point by checking out the groups they forward from. I’ve discovered a few great groups like that.
Most of the stuff that comes up on search is spam or garbage though, agreed. I’ve found the majority of news groups I follow through twitter.
I found it useful to screenrecord all my computer activity. It is OCR searchable. This way if I vaguely remember some bits, I can reconstruct my sources. It also preserves original text, in case sources get edited.
I tried something similar, I had a text file called mylog.txt and whenever I hit alt + m it would append whatever was in the clipboard to the text file along with the date & time. But it turned out I was terrible at predicting what I would want to find again and it turned out to be not very useful to me.
I've thought about this or something similar (e.g. ArchiveBox).
It seems like 95-99% of the content you archive will be junk that you never look at again, but that 1-4% really might be worth it. Especially if it gets taken down.
What do you do for storage, though? I haven't invested in a NAS or personal storage over 1TB.
I think I understand now why some of these archive utilities offer you the option to upload to the internet archive in addition to storing locally. You can build up a local cache and then start removing the oldest page snapshots (but keep the link trail) once you start to run out of space.
The other day I searched for the Tom & Jerry full episodes on the web to no avail (streaming platforms and video platforms like youtube only have rubbish cuts).
The internet archive has every episode starting from the first one in 1940, in an easily accessible player without any ads or recommendations: https://archive.org/details/tom-and-jerry-all-114-episodes