Post by account_disabled on Mar 10, 2024 23:42:36 GMT -6
At the peak we had two pages in the index with over links on each of these pages. These could have been quite legitimate pages but it was hard to tell given the language barrier. However in terms of SEO analysis these pages were providing very little link equity and thus not contributing much to the index. This is not exclusively a problem with the .cn TLD this happens on a lot of spammy sites. But we did find a huge cluster of sites in the .cn TLD that were close together lexicographically causing a hot spot in our processing cluster. We had a hour DNS outage that went unnoticed. DNS is the backbone of the Internet.
It should never die. If DNS fails the Internet more or less Europe Cell Phone Number List dies as it becomes impossible to lookup the IP address of a domain. Our crawlers unfortunately experienced a DNS outage. The crawlers continued to crawl but marked all the pages they crawled as DNS failures. Generally when we have a or been taken offline. Fun fact the average life expectancy of a domain is days. This information is passed back to the schedulers and the domain is blacklisted for days then retried. If it fails again then we remove it from the schedulers. In a hour period we crawl a lot of sites approximately.
We ended up banning a lot of sites from being recrawled for a day period and many of them were highvalue domains. Because we banned a lot of highvalue domains we filled that space with lowerquality domains for days. This isnt a huge problem for the index as we use more than days of data in the end we still included the quality domains. But it did cause a skew in what we crawled and we took a deep dive into the .cn and .pw TLDs. This caused the perfect storm. We imported a lot of new domains whose initial DA is unknown that we had not seen previously.
It should never die. If DNS fails the Internet more or less Europe Cell Phone Number List dies as it becomes impossible to lookup the IP address of a domain. Our crawlers unfortunately experienced a DNS outage. The crawlers continued to crawl but marked all the pages they crawled as DNS failures. Generally when we have a or been taken offline. Fun fact the average life expectancy of a domain is days. This information is passed back to the schedulers and the domain is blacklisted for days then retried. If it fails again then we remove it from the schedulers. In a hour period we crawl a lot of sites approximately.
We ended up banning a lot of sites from being recrawled for a day period and many of them were highvalue domains. Because we banned a lot of highvalue domains we filled that space with lowerquality domains for days. This isnt a huge problem for the index as we use more than days of data in the end we still included the quality domains. But it did cause a skew in what we crawled and we took a deep dive into the .cn and .pw TLDs. This caused the perfect storm. We imported a lot of new domains whose initial DA is unknown that we had not seen previously.