Why We Shouldn’t Save Our Scraped Data in MongoDB

No Comments

Why We Shouldn’t Save Our Scraped Data in MongoDB

  • 18/01/2023

Scrapinghub initially utilized MongoDB to store scraped data due to its convenience. The scraped data is represented as (potentially nested) JSON-serializable records. The schema is undetermined in advance and may vary from task to job. The saved data must be accessible for viewing, querying, and downloading. This was quite simple to do using MongoDB (far simpler than the options available a few years ago), and it functioned nicely for a time.

Our platform’s back end has evolved from a simple repository for scraped data utilized in a few projects. Now that we see difficulties with our present design, we have opted to abandon MongoDB in favor of a different solution (more in a later blog post). Many clients are astonished to learn that we’re abandoning MongoDB; I hope this blog post clarifies why it didn’t work for us.

Locking

We receive a lot of brief inquiries, most of which come from site crawls. Due to their quick execution and consistent volume, these issues are uncommon. We do, however, have a smaller number of longer-running queries (such as exporting, filtering, bulk deleting, sorting, etc.), and lock contention occurs when several of these runs concurrently.

Prior to version 2.2, each MongoDB database includes a Readers-Writer lock. Because of lock contention, all short queries must wait longer, while long-running queries grow significantly longer. Short queries time out and are retried because they take too long. All worker threads in our web server become stopped while querying MongoDB due to the length of time it takes for requests from our website (such as visitors viewing data). The website and all web crawls eventually stop functioning!

As a solution, we:

  • The MongoDB driver was changed to time out operations and retry some searches.
  • Using a backoff that is exponential
  • Sync data with our new backend storage and execute some bulk queries there.
  • Possess numerous distinct MongoDB databases with data divided among them
  • Our servers were expanded

A Lack of Space Efficiency

Due to locking, MongoDB cannot recover disk space consumed by deleted objects automatically, and manually recovering space would require a significant amount of downtime. It will make an effort to reuse space for newly added objects. However, this frequently results in extremely fragmented data. We cannot defragment without causing downtime due to locking.

Unfortunately, MongoDB lacks built-in compression, despite the fact that scraped data frequently compress effectively. We don’t compress data before inserting because we need to search the data, and individual records are frequently small.

It can be wasteful always to save object field names, especially when they never change in some collections.

There is no cap on the number of items that can be written in a single crawl task, and jobs with a few million items are not at all uncommon. MongoDB must walk the index from the start to the specified offset when reading data from the middle of a crawl job. When scrolling through a job with a lot of data, it becomes slow.

Through our API, users can download job data by paginating the results. It’s incredibly slow for large operations (such as those with more than a million items), and some customers try to get around this by sending out numerous queries concurrently, which of course, increases the server load and lock contention.

The memory working set cannot be maintained

Per node, we have many TB of data. It should be able to maintain the frequently visited bits in memory because they are tiny enough. Crawl data that is infrequently accessed is routinely searched consecutively.

Since MongoDB does not provide us much control over data placement, frequently accessed data (or data that is scanned collectively) may be dispersed over a wide area. There is no mechanism to stop data that has only been scanned once from removing the more frequently accessed material from memory. MongoDB gets IO bound , and lock contention arises after the frequently accessed data is no longer in memory.

Frequently asked questions:

What is lacking in MongoDB?

The lack of transaction functionality in MongoDB is one of its drawbacks. Although transactions are becoming less and less common in applications, some still need them in order to update numerous documents or collections. MongoDB shouldn’t be used if your team needs that feature.

Is MongoDB a reliable place to save user data?

A team working on software can gain a lot from using MongoDB. Because of its adaptable schema, it is simple to store data in a way that is simple for programmers to use. In addition, MongoDB supports all the key features of contemporary databases, including transactions, and is designed to scale very quickly.

What are the constraints placed on documents in MongoDB?

16 megabytes is the maximum size for a BSON document. The maximum document size ensures that a single document won’t consume an excessive amount of RAM or bandwidth while being transmitted. MongoDB offers the GridFS API to store documents that are larger than the maximum allowed size.

Request a free quote

At Hir Infotech, we know that every dollar you spend on your business is an investment, and when you don’t get a return on that investment, it’s money down the drain. To ensure that we’re the right business with you before you spend a single dollar, and to make working with us as easy as possible, we offer free quotes for your project.

Subscribe to our newsletter!

About us and this blog

We are a digital marketing company with a focus on helping our customers achieve great results across several key areas.

Request a free quote

We offer professional SEO services that help websites increase their organic search score drastically in order to compete for the highest rankings even when it comes to highly competitive keywords.

Subscribe to our newsletter!

More from our blog

See all posts