Netdata Blog | Netdata Blog

Data & ML @ Netdata

systemd - netdata

Today, we released our systemd journal plugin for Netdata, allowing you to explore, view, search, filter and analyze systemd journal logs.

Like most things about Netdata, this is a zero-configuration plugin. You don’t have to do anything apart from installing Netdata on your systems.This is key design direction for Netdata, since we want Netdata to be able to help even if you install it mid-crisis, while you have an incident at hand.

systemd journal logs: A Game-Changer for DevOps and Developers

October 9, 2023 · 3 min read

Costa Tsaousis

Founder & Chief Executive Officer

“Why bother with it? I let it run in the background and focus on more important DevOps work.” — a random DevOps Engineer at Reddit r/devops

In an era where technology is evolving at breakneck speeds, it's easy to overlook the tools that are right under our noses. One such underutilized powerhouse is the systemd journal. For many, it's a mere tool to check the status of systemd service units or to tail the most recent events (journalctl -f). Others who do mainly container work, ignore even its existence.

What is the purpose of systemd-journal?

However, the systemd journal includes very important information. Kernel errors, application crashes, out of memory process kills, storage related anomalies, crucial security intel like ssh or sudo attempts and security audit logs, connection / disconnection errors, network related problems, and a lot more. The system journal is brimming with data that can offer deep insights into the health and security of our systems and still many professional system and devops engineers tend to ignore it.

Netdata Cloud On Prem: Infrastructure Monitoring enters the next level

September 26, 2023 · 5 min read

Data & ML @ Netdata

netdata-cloud-on-prem

Netdata, Prometheus, Grafana Stack

September 26, 2023 · 11 min read

Technical Product Manager

netdata-prometheus-grafana-stack

In this blog, we will walk you through the basics of getting Netdata, Prometheus and Grafana all working together and monitoring your application servers. This article will be using docker on your local workstation. We will be working with docker in an ad-hoc way, launching containers that run /bin/bash and attaching a TTY to them. We use docker here in a purely academic fashion and do not condone running Netdata in a container. We pick this method so individuals without cloud accounts or access to VMs can try this out and for it's speed of deployment.

Netdata Processes monitoring and its comparison with other console based tools

September 26, 2023 · 7 min read

Technical Product Manager

netdata-prometheus-grafana-stack

Netdata reads /proc/<pid>/stat for all processes, once per second and extracts utime and stime (user and system cpu utilization), much like all the console tools do.

But it also extracts cutime and cstime that account the user and system time of the exit children of each process. By keeping a map in memory of the whole process tree, it is capable of assigning the right time to every process, taking into account all its exited children.

It is tricky, since a process may be running for 1 hour and once it exits, its parent should not receive the whole 1 hour of cpu time in just 1 second - you have to subtract the cpu time that has been reported for it prior to this iteration.

It is even trickier, because walking through the entire process tree takes some time itself. So, if you sum the CPU utilization of all processes, you might have more CPU time than the reported total cpu time of the system. Netdata solves this, by adapting the per process cpu utilization to the total of the system.

Netdata QoS Classes monitoring

September 26, 2023 · 9 min read

Technical Product Manager

netdata-qos-classes

Netdata monitors tc QoS classes for all interfaces.

If you also use FireQOS it will collect interface and class names.

There is a shell helper for this (all parsing is done by the plugin in C code - this shell script is just a configuration for the command to run to get tc output).

The source of the tc plugin is here. It is somewhat complex, because a state machine was needed to keep track of all the tc classes, including the pseudo classes tc dynamically creates.

You can see a live demo here.

Our first ML based anomaly alert

September 13, 2023 · 4 min read

Analytics & ML Lead

node-anomaly-rate-alert

Over the last few years we have slowly and methodically been building out the ML based capabilities of the Netdata agent, dogfooding and iterating as we go. To date, these features have mostly been somewhat reactive and tools to aid once you are already troubleshooting.

Now we feel we are ready to take a first gentle step into some more proactive use cases, starting with a simple node level anomaly rate alert.

Anomaly Rate By Type

August 30, 2023 · 4 min read

Analytics & ML Lead

anomaly-rate-by-type

We have recently added a more detailed anomaly rate chart to Netdata that breaks out the overall node anomaly rate by type, this lets you more easily see what parts of your infrastructure might be experiencing an uptick in anomalies when you see the overall node anomaly rate increase.

Release 1.41.0: Netdata Agents and Parents now have a new UI!

July 20, 2023 · 13 min read

Netdata Agents and Parents now have a new UI!

Checkout the release meetup video or read on to learn more about the new UI and other features in this release.

Netdata Assistant: Your AI-Powered Troubleshooting Sidekick

July 14, 2023 · 3 min read

Data & ML @ Netdata

Hey there! We're excited to share a new troubleshooting feature we have added to Netdata, the Netdata Assistant. We've built this tool to help you troubleshoot more effectively and with less stress. Let's dive in.

The Hidden Costs of Monitoring

July 7, 2023 · 18 min read

Technical Product Manager

hidden-costs

When it comes to monitoring IT infrastructure, the costs you see on the price tag of the tool are often just the tip of the iceberg. Below the waterline, a mass of hidden costs can lurk, which can significantly affect the total cost of ownership.

Netdata & Ansible example: ML demo room

July 7, 2023 · 3 min read

Analytics & ML Lead

netdata-ansible

We are always trying to lower the barrier to entry when it comes to monitoring and observability and one place we have consistently witnessed some pain from users is around adopting and approaching configuration management tools and practices as your infrastructure grows and becomes more complex.

To that end, we have begun recently publishing our own little example ansible project used to maintain and manage the servers used in our public Machine Learning Demo room.

This post introduces this project as a somewhat simple example of using Ansible with Netdata. Read on to learn more, but more importantly feel free to explore the repo and see how it all hangs together.

Netdata Parents (Streaming and Replication)

June 30, 2023 · 14 min read

Technical Product Manager

stacked-netdata

What are they and why do we need them?

Technical Product Manager

Unlock the full potential of your cloud investment! Discover strategies to enhance performance and reduce costs.

Navigating the Path to Cloud Migration: Key Challenges and Best Practices

May 14, 2023 · 20 min read

Technical Product Manager

Embarking on a cloud migration journey? Grasp the obstacles and arm yourself with best practices for a smooth transition. Success lies in understanding, planning, and adapting.

Transforming Monitoring with a Machine Learning-First Approach

May 11, 2023 · 6 min read

Data & ML @ Netdata

Unlocking the full potential of monitoring through ML integration, anomaly detection, and innovative scoring engines.

The Future of Monitoring is Automated and Opinionated

May 9, 2023 · 5 min read

Costa Tsaousis

Founder & Chief Executive Officer

So, you think you monitor your infra?

Release 1.39.0: A new era for monitoring charts.

May 8, 2023 · 36 min read

Another release of the Netdata Monitoring solution is here!

Monitoring to Infinity and Beyond - How Netdata Scales Without Limits

May 4, 2023 · 9 min read

Data & ML @ Netdata

Scalability is crucial for monitoring systems as it ensures that they can accommodate growth, maintain performance, provide flexibility, optimize costs, enhance fault tolerance, and support informed decision-making, all of which are critical for effective infrastructure management.

Monitoring Disks: Understanding Workload, Performance, Utilization, Saturation, and Latency

May 4, 2023 · 13 min read

Technical Product Manager

stacked-netdata

Netdata provides a comprehensive set of charts that can help you understand the workload, performance, utilization, saturation, latency, responsiveness, and maintenance activities of your disks. In this blog we will focus on monitoring disks as block devices, not as filesystems or mount points.

Understanding Huge Pages

May 4, 2023 · 12 min read

Technical Product Manager

stacked-netdata

Memory-intensive applications can benefit from improved performance by using huge pages, as they can reduce TLB pressure and memory fragmentation, and lower the memory management overhead overall. Developers should consider using HugeTLBfs in their mmap() and shmget() calls to take advantage of huge pages.

Transparent Huge Pages (THP) is a Linux kernel feature that provides some of the benefits of huge pages without requiring any development effort. However, THP can cause latency in many applications. Although kernel developers are actively working to address these issues, many system administrators prefer to disable THP altogether.

Netdata can assist in determining whether THP is helpful or harmful to your applications, which can guide your decision regarding its use.

Unlock the Secrets of Kernel Memory Usage

May 4, 2023 · 9 min read

Technical Product Manager

stacked-netdata

The mem.kernel chart in Netdata provides insight into the memory usage of various kernel subsystems and mechanisms. By understanding these dimensions and their technical details, you can monitor your system's kernel memory usage and identify potential issues or inefficiencies. Monitoring these dimensions can help you ensure that your system is running efficiently and provide valuable insights into the performance of your kernel and memory subsystem.

mem-kernel

Understanding Entropy: The Key to Secure Cryptography and Randomness

May 3, 2023 · 11 min read

Technical Product Manager

stacked-netdata

Entropy is a measure of the randomness or unpredictability of data. In the context of cryptography, entropy is used to generate random numbers or keys that are essential for secure communication and encryption. Without a good source of entropy, cryptographic protocols can become vulnerable to attacks that exploit the predictability of the generated keys.

Server Uptime Monitoring: Why do we need it?

May 2, 2023 · 6 min read

Technical Product Manager

stacked-netdata

Server uptime monitoring tracks the availability and reliability of servers within your infrastructure.

Swap Memory - When and How to Use It on Your Production Systems or Cloud-Provided VMs

May 2, 2023 · 4 min read

Technical Product Manager

stacked-netdata

Swap memory, also known as virtual memory, is a space on a hard disk that is used to supplement the physical memory (RAM) of a computer. The swap space is used when the system runs out of physical memory, and it moves less frequently accessed data from RAM to the hard disk, freeing up space in RAM for more frequently accessed data. But should swap memory be enabled on production systems and cloud-provided virtual machines (VMs)? Let's explore the pros and cons.

Understanding Context Switching and Its Impact on System Performance

May 2, 2023 · 5 min read

April 12, 2023 · 15 min read

Austin S. Hemmelgarn

Senior Agent SRE

Need to monitor a UNIX-like system, but can’t install Netdata on it? With our SNMP collector and Net-SNMP, you can get basic system information with just a bit of relatively quick and easy configuration.

Anomaly Rates in the Menu!

March 29, 2023 · 6 min read

Analytics & ML Lead

The menu (on the overview or single node tab) now has an anomaly rate button built into it that, for the entire visible window or a highlighted time range, shows the maximum chart anomaly rate within each section.

Read on to learn more about this new feature!

Introducing the Netdata demo space

March 24, 2023 · 2 min read

Data & ML @ Netdata

Introducing Netdata's Demo Space, a quick and easy way to experience monitoring environments before you set them up yourself.

Google Colab Monitoring with Netdata

March 24, 2023 · 3 min read

Data & ML @ Netdata

Hello, fellow data enthusiasts and Google Colab aficionados! Today, we're going to explore how to monitor your Google Colab instances using Netdata. Colab is a fantastic platform for running Notebooks, developing ML models, and other data science and analytics tasks. But have you ever wondered how your Colab instance is performing under the hood? That's where Netdata comes into play!

Upcoming Changes to Plugins in Native Packages

March 15, 2023 · 6 min read

Austin S. Hemmelgarn

Senior Agent SRE

At Netdata, we’re committed to trying to make Netdata work as well as possible for our users. Sometimes though, that means changing things in ways that aren’t exactly seamless. Such a change is coming soon for users of our native DEB and RPM packages, and this blog post will explain what’s happening, why we’re doing it, and what it means for our users.

Windows Server Monitoring Improvements

March 13, 2023 · 2 min read

Data & ML @ Netdata

Monitor your Windows server and applications running on it with Netdata - simple, powerful and free.

Anomaly detection on Prometheus metrics

March 1, 2023 · 4 min read

Analytics & ML Lead

We have recently extended the native machine learning (ML) based anomaly detection capabilities of Netdata to support all metrics, regardless on their collection frequency (update every).

Previously only metrics collected every second were supported, but now Netdata can run anomaly detection out of the box with zero config on metrics with any collection frequency.

This post will illustrate an example of what this means using Prometheus metrics (via the Netdata Prometheus collector) since they typically have a default collection frequency of 10 seconds.

Monitor any SQL metrics with Netdata (and Pandas ❤️)

February 22, 2023 · 9 min read

Analytics & ML Lead

We recently got this great feedback from a dear user in our Discord:

I would really like to use Netdata to monitor custom internal metrics that come from SQL, not a fan of having 10 diff systems doing essentially the same thing as is, Netdata is pretty much all there in that regard, just needs a few extra features.

This is great and exactly what we want, a clear problem or improvement we could make to help make that users monitoring life a little easier.

This is also where the beauty of open source comes in and being able to build on the shoulders of giants - adding such a feature turned out to be pretty easy by just extending our existing Pandas collector to support SQL queries leveraging its read_sql() capabilities.

Here is the PR that was merged a few days later.

This blog post will cover an example of using the Pandas collector to monitor some custom SQL metrics from a WordPress MySQL database.

Introducing Netdata Functions

February 15, 2023 · 5 min read

Data & ML @ Netdata

Netdata is committed to making it simpler and easier for everyone to monitor and troubleshoot their infrastructure. With that goal in mind, we're excited to announce the launch of our new "Functions" feature, which allows Netdata Agent collectors to expose "functions" that can be executed in run-time and on-demand.

Introducing Netdata Paid Subscriptions

February 10, 2023 · 4 min read

All Netdata functionality is and will be available for free forever in the Community Plan. Paid tiers include features targeted for businesses and users who would need to customise their monitoring solution with different levels of user access, extra notification mechanisms, customer support and more.

Release 1.38.0: Dramatic performance and stability improvements, with a smaller agent footprint

February 6, 2023 · 49 min read

Another release of the Netdata Monitoring solution is here!

Extending Netdata's anomaly detection training window

February 2, 2023 · 9 min read

Analytics & ML Lead

extending-anomaly-detection-training-window

logo

How to monitor and troubleshoot Chrony

December 12, 2022 · 6 min read

Data & ML @ Netdata

Find out how to effectively and easily monitor and troubleshoot Chrony using Netdata

logo

Release 1.37.1: Patch release for security issues

December 5, 2022 · 2 min read

Netdata v1.37.1 is a patch release to address issues discovered since v1.37.0. Refer to the v.1.37.0 release notes for the full scope of that release.

Release 1.37.0: Infinite scalability, database tiering, and much more

November 30, 2022 · 38 min read

Another release of the Netdata Monitoring solution is here!

November 9, 2022 · 5 min read

Technical Product Manager

Web servers are among the most important components in modern IT infrastructures. They host the websites, web services, and web applications that we use on a daily basis. Social networking, media streaming, software as a service (SaaS), and other activities wouldn’t be possible without the use of web servers. And with the advent of cloud computing and the movement of more services online, web servers and their monitoring are only becoming more important. Given the extensive usage of Web servers, Sysadmins and SREs should monitor web servers as a key aspect for performance.

logo

How to mute alerts during maintenance windows or scheduled backups?

November 3, 2022 · 6 min read

Technical Product Manager

The health management APIs in Netdata allows teams to eliminate unnecessary alerting during scheduled maintenance, testing, auto scaling events, and instance reboots.

Monitor indoor air quality with Airthings and Netdata

November 2, 2022 · 5 min read

Data & ML @ Netdata

Monitoring indoor air quality with Airthings and Netdata. Understanding and measuring common contaminants and pollutants reduces your risk of air quality health concerns.

Monitor KSM performance with Netdata

November 1, 2022 · 4 min read

Data & ML @ Netdata

Monitoring KSM (Kernel Same-page Merging) performance at deduping memory shared across VMs.

Monitoring & troubleshooting Cassandra with Netdata

October 29, 2022 · 5 min read

October 21, 2022 · 3 min read

Chris Akritidis

Chief Operations Officer

The life of a sysadmin or SRE is often difficult, but occasionally very simple things can make a huge difference. Basic monitoring of your systemd services is one of those simple things, which we sometimes overlook. The simplest question one would want to know is if the thing that’s supposed to be running is actually running at all. If you use systemd services, you can guarantee an answer to that question within minutes using Netdata.

How to monitor web servers and their performance

October 20, 2022 · 4 min read

Technical Product Manager

How you can use the Pandas Python collector to monitor weather data

October 19, 2022 · 8 min read

Analytics & ML Lead

netdata-pandas

Netdata just got a Pandas collector.

How to monitor HTTP endpoints

October 17, 2022 · 5 min read

Chris Akritidis

Chief Operations Officer

The HTTP protocol has become the de facto standard application layer protocol of the internet. From publicly available web sites and APIs to “inter-process” communications in REST based microservice architectures or large Service Oriented Architectures based on SOAP, you find HTTP being used again and again, due to its simplicity and our familiarity with it. How many protocols can you name that have memes for their status codes? Of course, such a popular protocol has endless pages written about how to properly monitor the services that rely on it, with many options specific to every use case.

How to monitor DNS query response time

October 12, 2022 · 5 min read

Data & ML @ Netdata

DNS (Domain Name System) servers translate standard language web addresses to their actual IP addresses for network access.

DNS Lookup Journey

Why is data replication important?

October 12, 2022 · 9 min read

Alex Malkov

VP of Engineering

High availability. This is what every monitoring tool needs to ensure that you never compromise on IT infrastructure visibility.

How to monitor host reachability

October 10, 2022 · 6 min read

Chris Akritidis

Chief Operations Officer

Most sysadmins and developers have at some point used a few of the popular Linux networking commands or their Windows equivalents to answer the common questions of host reachability - that is, whether a host or service is reachable and how fast it responds.

Introducing the Netdata Source Plugin for Grafana

October 7, 2022 · 8 min read

Hugo Valente

Technical Product Manager

sample-dashboard

The open-source community is about to benefit greatly from Netdata's new Grafana data source plugin, which makes use of a powerful data collection engine.

How to filter metrics by label?

October 6, 2022 · 3 min read

Technical Product Manager

It is sometimes easy to get lost in the mountain of metrics and infinite number of dimensions when working with an infrastructure monitoring tool. Being able to filter metrics by label and visualize only what is relevant to the current scope of monitoring &troubleshooting, becomes absolutely crucial to the success of SREs, Sysadmins and DevOps professionals.

Missing indexes in PostgreSQL? How to quickly identify it

October 5, 2022 · 2 min read

Technical Product Manager

While working on improving the Netdata PostgreSQL collector, we were monitoring our production PostgreSQL instance and something caught our attention immediately. The rows fetched ratio seemed really, really low for one particular database... there were missing indexes in PostgreSQL!

Redis Monitoring

September 29, 2022 · 11 min read

Data & ML @ Netdata

PostgreSQL Monitoring

September 16, 2022 · 24 min read

Data & ML @ Netdata

logo

Data Collection Strategies for Infrastructure Monitoring – Troubleshooting Specifics

September 6, 2022 · 17 min read

Alex Malkov

VP of Engineering

How Netdata’s Machine Learning works

September 1, 2022 · One min read

Analytics & ML Lead

Following on from the recent launch of our Anomaly Advisor feature, and in keeping with our approach to machine learning, here is a detailed Python notebook outlining exactly how the machine learning powering the Anomaly Advisor actually works under the hood.

Anomaly rate in every chart

June 23, 2022 · 3 min read

Data & ML @ Netdata

Metric Correlations on the Agent

June 15, 2022 · 3 min read

Analytics & ML Lead

As of v1.35.0 the Netdata Agent can now run Metric Correlations (MC) itself. This means that, for nodes with MC enabled, the Metric Correlations feature just got a whole lot faster!

Introducing Anomaly Advisor – Unsupervised Anomaly Detection in Netdata

May 26, 2022 · 4 min read

Analytics & ML Lead

Today we are excited to launch one of our flagship ML assisted troubleshooting features in Netdata – the Anomaly Advisor.

The Anomaly Advisor builds on earlier work to introduce unsupervised anomaly detection capabilities into the Netdata Agent from v1.32.0 onwards.

Monitoring without Cooperation: Kubernetes

May 20, 2022 · 4 min read

Data & ML @ Netdata

Kubernetes Throttling Doesn’t Have To Suck. Let Us Help!

May 3, 2022 · 8 min read

Costa Tsaousis

Founder & Chief Executive Officer

CPU limits are probably the most misunderstood concept in Kubernetes CPU resources allocation and management.

Troubleshooting Alerts the Right Way: As a Team

April 28, 2022 · 4 min read

Tasos Katsoulas

Software Engineer

CNCF Live: Power up your machine learning – Automated anomaly detection

April 27, 2022 · 2 min read

Analytics & ML Lead

The Netdata Way of Troubleshooting

April 4, 2022 · 3 min read

Together with you, our fabulous community, Netdata is changing the way the world thinks of high fidelity monitoring - and we are gaining momentum.

Our Approach to Machine Learning

March 25, 2022 · 13 min read