服務狀態例行性維護、排定維護、臨時維護...

Server maintenance (已解決)

優先級 - 中
影響範圍系統 - nlcp01,nlcp02,nlcp03,nlcp05,nlcp06,nlcp07,nlcp08,nlsh04
Over the coming week from Friday the 13th at 10pm Amsterdam timezone until Friday evening the 20th at 11:59pm we'll perform maintenance on multiple servers.

To increase the redundancy we'll configure all servers in a bonded network configuration, what this effectively means is we double the network capacity of each individual server, but also ensure that even if one of our switches were to have issues, the other one will simply handle the traffic automatically with failover in place.

We expect the downtime to be minimal for each server, however an outage of up to 15 minutes could be possible. In most cases the downtime will be less than 1 minute.

13/Jun/2025:

Update 9:58pm: We'll start with nlsh04 shortly

Update 10:08pm: nlsh04 completed. IPv4 had 26 seconds of downtime, and IPv6 had 5 minutes and 12 seconds of downtime.

Update 10:24pm: We'll proceed with nlcp01 shortly

Update 11:16pm: An attempt was made to change to bonding on nlcp01, however it resulted in no connectivity after the bonding got re-established despite the bonding itself came up as expected. Multiple things were tried to restore connectivity, until we then decided to roll back the configuration to the original config. This also means we'll halt the changes for all remaining systems today.

We'll try to replicate the issue in our test environment to see if we can reproduce the issue specifically related to this.

14/Jun/2025:

Update 01:02am: We managed to replicate the issue in our testing environment, specifically to specific cPanel configuration. A fix has been identified and implemented across all servers, and we will continue to perform the network change tomorrow evening.

Update 11:17pm: We'll start with nlcp02 shortly.

Update 11:32pm: nlcp02 completed. IPv4 had 26 seconds of downtime. We'll prepare nlcp01.

Update 11:38pm: nlcp01 completed. IPv4 had 25 seconds of downtime. We'll prepare nlcp03.

Update 11:57pm: nlcp03 completed. IPv4 had 33 seconds of downtime. We'll prepare nlcp05.

15/Jun/2025:

Update 00:18am: nlcp05 completed. IPv4 had 62 seconds of downtime. We'll prepare nlcp06.

Update 01:08am: nlcp06 completed. IPv4 had 43 seconds of downtime. The remaining two systems will be completed in the evening (sunday) after 10pm.

Update 9:55pm: We'll start with nlcp07 shortly.

Update 10:10pm: nlcp07 completed. IPv4/IPv6 had 21 seconds of downtime. We'll proceed with nlcp08 shortly.

Update 22:15pm: nlcp08 completed. IPv4 had 25 seconds of downtime. This marks the completion, and all servers now have bonded NICs.
日期 - 13/06/2025 22:00 - 15/06/2025 22:15
最後更新 - 15/06/2025 22:16

nlcp03 emergency reboot (已解決)

優先級 - 重大
影響範圍伺服器 - NLCP03
We have to perform an emergency reboot of server nlcp03 due to a kernel issue.

This reboot will be done around 10.15pm Amsterdam timezone. We expect around 5 minutes of downtime for the reboot, it may be a bit longer.

The server was rebooted at 10:18pm and came back online again at 10.26pm.
日期 - 09/05/2025 22:00 - 09/05/2025 22:26
最後更新 - 13/06/2025 10:13

nlcp03 downtime (已解決)

優先級 - 重大
影響範圍伺服器 - NLCP03
nlcp03 experienced some downtime during the evening, lasting on and off for a total of roughly 1 hour.

Below you'll find a detailed event log:

March 05:

8.34pm: Our monitoring triggered an alert that nlcp03 were unavailable. Upon investigation we saw that the network driver for the Mellanox network card in the server had locked up.

8.36pm: The once again became reachable and started serving traffic.

9.55pm: We started seeing stability issues again, with random timeouts for some connections, while other succeeded. We continued to investigate possible causes and solutions.

9.58pm: This was further confirmed by a customer ticket stating slow loading times as well.

10.05pm: Due to the stability getting worse and worse, we restarted the system, due to the initial driver error, it can put the system into a state where drivers may not fully recover, and reboots may quite often solve this due to unloading and loading the network driver from scratch again.

10.10pm: The server came online again.

11.49pm: The system once again started causing stability issues, to resolve this, we implemented a possible fix by setting `iommu=pt` as one of the boot options, since this has been known to fix similar NIC errors on AMD EPYC 7002 series servers (Which is what the particular server runs).

11.51pm: The system came online, we began our verification.

March 06:

12.05am: The iommu=pt fix didn't get implemented correctly, due to tuned profiles overriding the options, we changed the implementation and rebooted the system again to apply the iommu=pt fix.

12.10am: The system came online, and we began verifying the fix.

12:31am: The iommu=pt fix didn't resolve the issue, making us believe that the cause is a failing network card, we prepared a replacement server, and started our route towards the datacenter.

12:52am: Another reboot was made to bring back the system temporarily after losing the network.

01.15am: We arrived at the datacenter, passed through security, ensured the switch configurations were correct for the replacement hardware.

01:36am: nlcp03 powered down to move the drives to another physical server

01.41am: The server came back online in the new chassis, and we're since monitoring the situation.

The old system will be receiving some temporary drives, so we can perform some additional tests.
日期 - 05/03/2025 20:35 - 06/03/2025 02:00
最後更新 - 07/03/2025 23:50

Datacenter migration (已解決)

優先級 - 高
影響範圍其他 - nlcp01,nlcp02,nlcp03,nlcp05,nlcp06,nlcp07,nlcp08,nlsh04,proxmox05,proxmox06,proxmox07
Between 10pm the 18th, and 2am the 19th of December, a datacenter migration took place.

Most systems were online within 2 hours (so around midnight).

nlcp06 had some issues, which took longer to resolve due to a broken PSU, the PSU was replaced with an on-site spare PSU.

nlsh04 had issues due to the per-customer CPU, memory and disk limits were not being applied correctly. After we brought some internal systems up, this started to work again. We've taken steps to ensure this will not happen in the future.

During January and February, we'll slowly increase the redundancy on the network config on each server, this requires reconfiguring the servers network interfaces, as well as the switches, to apply the redundancy.

This will be done during the night after midnight, and it will in some cases cause up to 5-10 minutes of downtime, depending on the system configuration, in many cases however, it will be less than a minute.
日期 - 18/12/2024 22:00 - 19/12/2024 02:00
最後更新 - 19/12/2024 03:45

hosting-panel.net redirector and URL Scheduler unavailable (已解決)

優先級 - 中
影響範圍系統 - Redirector and URL Scheduler
As of 2am Amsterdam timezone, our Redirector and URL Scheduler for hosting-panel.net are unavailable. These two systems are located with the same upstream provider. A solution is being worked on as we speak.

3:10am: The Redirector is back online. The URL Scheduler remains unavailable at this time.

9.55am: The system is back online.
日期 - 02/09/2024 02:00 - 02/09/2024 09:55
最後更新 - 02/09/2024 16:18

.dk domain registrations delayed (已解決)

優先級 - 高
影響範圍系統 - .dk domain registrations
We're aware of delays in .dk processing / registration

We're investigating the issue with our suppliers, but it boils down to a change in how data is required in the EPP service for .dk domains. We're working on resolving it as soon as possible.

Update 18/04/2024 1:30pm Amsterdam timezone: The problem has been resolved
日期 - 17/04/2024 14:00 - 18/04/2024 13:30
最後更新 - 18/04/2024 13:30

de-mail01 outage and replacement (已解決)

優先級 - 重大
影響範圍系統 - de-mail01
At 10.50pm we got a report of slow email sending on de-mail01

At 11.24pm after investigating the issue we decided to reboot the system due to indicators that the systemd process (the main management process) didn't function as it should causing various issues with managing systems.

As a result of this we performed a manual backup of the system, including email and configuration.

At 11.42pm the system was restarted and came back online as normal after a few minutes.

We decided at the same time to do a disk integrity check due to the fact we had to force reboot the system. The check indicated multiple errors on both drives in the storage array.

Due to both disks being affected, the #1 priority were to get a new system online, and configure it to take over the email accounts located on the system.

At 01.05am We started the testing of the configured system

At 01.34am We stopped dovecot and postfix on the old system, performing the last migration of files.

At 01.46am We switched the DNS, updated inbound mail-routing and updating the records in hosting-panel.net to reflect the new system.

At 01.50am We performed an update to fix some permissions to restore correct permissions to all accounts.

At 02.15am We completed our checks.

Backups has been re-enabled for the new system, the system is configured for monitoring.

The only last thing that will be done during tomorrow is restoring the full text search engine to speed up full text searching in dovecot.
日期 - 21/01/2024 22:50 - 22/01/2024 02:15
最後更新 - 22/01/2024 02:28

NLSH04 unavailability (已解決)

優先級 - 重大
影響範圍伺服器 - nlsh04.h4r-infra.net
Between 1.55AM and 3.31AM UTC we experienced a total of 15 minutes of downtime on server NLSH04.

This downtime was the cause of an unplanned urgent reboot due to a broken kernel module.

While provisioning new users on the system, we realized that some overall account limits were not applied correctly, after investigating this, we found out that a kernel module managing these limits were not loaded correctly. Due to the nature of the module it cannot be loaded again without rebooting the system.

The initial reboot we performed resulted in the module being loaded correctly, the downtime for this reboot was 5 minutes per our monitoring.

We however discovered that certain CloudLinux LVE features were no longer available due to a failure that caused certain software features to not work as expected. This prompted a second reboot, however, because the configuration was updated as a part of the software update, it resulted in critical boot parameters to no longer be present (parameters that effectively disables cgroups v2 which kmod-lve from CloudLinux is incompatible with). With the module not loading, it causes the system to not being able to make users enter a virtual secure environment, effectively rendering the service unavailable. Fixing these parameters and performing one final reboot, caused the system to come back up as expected.

The downtime for the 2nd outage lasted for 10 minutes.

We'll be taking additional steps to implement additional metrics into our systems to catch this faster, which hopefully should prevent an issue similar to this to occur in the future.

We're sorry for the inconvenience caused by the 15 minutes downtime.
日期 - 18/11/2023 02:55 - 18/11/2023 04:31
最後更新 - 18/11/2023 04:09

Network outage (已解決)

優先級 - 重大
影響範圍系統 - NLCP05, NLCP07
We're currently experiencing network connectivity issues on the servers NLCP07 and NLCP05 due to loss of connectivity, we're investigating the cause of this.

Update 4.55pm: The issue has been resolved after 41 minutes of downtime.

We're still looking into the definitive root cause of the problem, why it triggered in the first place, however, our current findings:

servers nlcp05, nlcp07 as well as a 3rd server used for some internal services lost their IPv4 connectivity at 4.14pm.

Upon investigation we saw that the switch side had lost the "arp" entries for IPv4, meaning the switch would stop knowing how to route packets. We however, saw that IPv6 worked without issue.

Ultimately we decided to disable a protection on the switch in regards to arp, which restored connectivity immediately for all systems.

We're still investigating the actual root cause, but for now, the systems should be functioning.
日期 - 06/08/2023 16:22 - 06/08/2023 16:55
最後更新 - 06/08/2023 17:15

Software + Hardware upgrade of all cPanel hosting (已解決)

優先級 - 中
影響範圍其他 - NLCP01 to NLCP08
For continued security and stability, we'll perform necessary software upgrades that require a system reboot. However, to further increase performance and capacity, we'll also upgrade our systems with additional memory and storage space.

This means we will take each server offline for up to 30 minutes (the expected time being much less in reality). We'll perform the upgrade between 3.30am and 7.30am. We'll do a single server at a time, in case we do not manage to upgrade all 8 servers in one go, we'll schedule another timeframe for the remaining servers shortly after.

The software maintenance is required to ensure the continued security of our platform. We strive to keep the downtime as low as possible.

Update Aug4 1:23am: We'll start the maintenance at 3.30am

Update Aug4 3:57am: We'll start with nlcp08

Update Aug4 4.23am: nlcp08 up with 10 minutes of downtime. Proceeding with nlcp07 in a couple of minutes.

Update Aug4 4.36am: nlcp07 up with 9 minutes of downtime. Proceeding with nlcp06 in a couple of minutes.

Update Aug4 4.52am: nlcp06 up with 11 minutes of downtime. Proceeding with nlcp05 in a couple of minutes.

Update Aug4 5.05am: nlcp05 up with 11 minutes of downtime. We'll take a short break and continue with the last 4 shortly.

Update Aug4 5.23am: Proceeding with nlcp04

Update Aug4 5.36am: nlcp04 up with 12 minutes of downtime. Proceeding with nlcp03 in a couple of minutes.

Update Aug4 5.54am: nlcp03 up with 8 minutes of downtime. Proceeding with nlcp02 in a couple of minutes.

Update Aug4 6.11am: nlcp02 up with 8 minutes of downtime. Proceeding with nlcp01 in a couple of minutes.

Update Aug4 6.25am: nlcp01 up with 10 minutes of downtime. Maintenance complete.
日期 - 03/08/2023 03:30 - 03/08/2023 07:30
最後更新 - 04/08/2023 06:26

Reduced backup rotation (已解決)

優先級 - 中
影響範圍系統 - backup
We're running with slightly reduced backup rotation while moving some systems around. Due to a fault in the backup system, we've had to shift traffic to a different system to then redo the main backup system. This means there's a limited number of days available (still above 7 days of data).
日期 - 17/06/2023 18:15 - 17/07/2023 00:00
最後更新 - 04/08/2023 01:26

System crash (已解決)

優先級 - 重大
影響範圍伺服器 - NLCP01
Between 03:45 and 03:55, we experienced 10 minutes of downtime on nlcp01.

The cause of the crash based on our investigation seems to be a kernel lockup, which resulted in the system freezing.

To resolve the issue a hardware reset were performed and the system came online as expected.

We have a planned maintenance for later this month to upgrade the kernel among other software, which should likely resolve the these issues.

We're sorry for the inconvenience caused by this.
日期 - 09/07/2023 03:45 - 09/07/2023 03:55
最後更新 - 09/07/2023 04:03

Jetbackup 5 restoration (已解決)

優先級 - 低
影響範圍其他 - Backup - jetbackup 5 - nlcp05/nlcp06
Customers on nlcp05 and nlcp06 might experience issues after restoring a backup via JetBackup 5.

The issues has been reported to Jetapps (The developers of JetBackup), and we're waiting for a release to fix the issues.

404 page after restore of subdomains or addon domains

JetBackup restores incorrect folder permissions for the "document root" of the domain.

The permissions are supposed to be "0755" but are restored as "0750" - this can be corrected after restoration in File Manager or FTP by selecting the folder for the domain and change the permissions.

User should have read/write/execute permissions.

Group and World/Everyone should have read/execute permissions.

You can also create a ticket for us to correct it.

Database restoration requires user restore

Jetbackup when restoring a database, does not restore user permissions to the database.

You either have to restore the DB User during restore as well for the corresponding database (this ensures that the "grants" are restored as they should be).

Alternatively, you can go to "MySQL® Databases" in cPanel, go to the "Add User to Database" section, select both the user and the database, click "Add" and assign all privileges to the database. This will restore the database permissions.
日期 - 26/05/2021 10:44
最後更新 - 02/07/2023 17:00

Backup Server unavailable (已解決)

優先級 - 高
Due to a failure of a top of rack switch, our backup system is currently unavailable for the time being, we're working on getting the connectivity restored as soon as possible.

When we can access the system again, it will be configured to have redundant uplinks.

Update 20:29: The system is available again
日期 - 01/02/2023 17:13 - 01/02/2023 20:29
最後更新 - 02/02/2023 01:16

Replacement of failing RAM (已解決)

優先級 - 高
影響範圍系統 - de-mail01
We have a DIMM in our de-mail01 system that is spitting out ECC correctable errors, since this usually will turn into uncorrectable errors over time, we're proactively replacing the memory of the server.

Since memory can't be changed during runtime, we'll have to power off the system at 11pm this evening to make the datacenter change the memory (this should be a relatively short process), we can however expect anywhere from 10 to 30 minutes downtime as a result before all services return to normal.

Our inbound mail-system will hold onto emails in the meantime for known email accounts on the given system.

Accessing email accounts located on the server (hosting-panel.net related accounts) will be unavailable in the time being.

Update 10.47pm: We'll shut down the system in a couple of minutes.

Update 11.33pm: The system is back online
日期 - 13/01/2023 23:00 - 13/01/2023 23:59
最後更新 - 13/01/2023 23:37

backup migration to JetBackup 5 (已解決)

優先級 - 低
影響範圍系統 - nlcp01-nlcp04
Over the coming weeks we're slowly migrating the servers nlcp01, nlcp02, nlcp03 and nlcp04 from JetBackup 4 to JetBackup 5. Since these two versions are incompatible on a storage level, we're therefore keeping both systems "live" for 28 days per server.

We've started the migration of nlcp01, nlcp02 and nlcp03 - you'll find both JetBackup 4 and JetBackup 5 in cPanel. Over time more and more recovery points will be available in JetBackup 5, and backups in JetBackup 4 will be rotated out.
日期 - 09/10/2021 17:43 - 09/11/2021 17:43
最後更新 - 18/11/2022 01:38

Outage of nlcp02 (已解決)

優先級 - 重大
影響範圍伺服器 - NLCP02
Between 8.51am and 9.48am Amsterdam time today, we experienced a total of 12 minutes of downtime of nlcp02, some sites may have been up to 20 minutes total.

This event was caused by two issues:
- MySQL reported down which required us to forcefully kill the database
- Subsequently the webserver decided to lock up, resulting in 503 Service Unavailable errors being returned after MySQL itself had recovered
We were unable to diagnostic the actual issue of MySQL since we could not query the system for any information, trying to gracefully stop the system did not do anything, where we eventually had to kill it forcefully. It however, resulted in two databases ending up with 1 table each that were not in a healthy state.

After MySQL was brought back online, we continued to experience roughly 5 minutes of issues with the webserver returning "503 service unavailable" errors. The cause for this was a request backlog that had to be processed for sites, so some sites would recover immediately where others would have been down for slightly longer, depending on how quick the queue got cleared.

Two databases needed to get their tables repaired, which were successful, and without any corruption or loss for the given databases.
日期 - 17/11/2022 08:51 - 17/11/2022 09:48
最後更新 - 17/11/2022 11:19

host reboot (已解決)

優先級 - 重大
影響範圍系統 - NLSH01
We're investigating an issue with the networking on nlsh01, which might in worst case require a reboot of the system.

We're sorry for the inconvenience
日期 - 22/03/2022 09:22
最後更新 - 20/04/2022 21:47

Halting incoming migrations (已解決)

優先級 - 低
影響範圍其他 - Migrations
Between November 15th at 00:00 until December 17th 00:00, we won't perform any incoming migrations, neither the free ones or paid.

All migrations that will involve us, will be planned either before or after this period.
日期 - 15/11/2021 00:00 - 17/12/2021 00:00
最後更新 - 11/02/2022 13:18

Network maintenance - WorldStream (已解決)

優先級 - 中
影響範圍系統 - nlcp01 - nlcp06
Between 6 am and 8 am on August 25th, WorldStream, the data center we use for all shared hosting servers, will perform network maintenance which requires updating the core routers and one distribution router to a new software release.

During this timeframe, there may be a short moment (a matter of seconds) when traffic switches from one router to another. This can occur multiple times during maintenance.

The network is fully redundant, and our systems are connected to two different routers, so impact remains minimal. It does put a higher risk on the uptime for the duration of the maintenance due to the lack of redundancy for some time.

The maintenance page at WorldStream will be kept up to date as well: https://noc.worldstream.nl/
日期 - 25/08/2021 06:00 - 25/08/2021 08:00
最後更新 - 14/09/2021 10:31

Network drop (已解決)

優先級 - 高
影響範圍系統 - nlcp01 - nlcp06
Between 00:01 and 00:03 we experienced a network drop affecting all servers in the WorldStream datacenter, due to a connectivity issue between WorldStream and Nikhef (A core network location in Netherlands) - the drop lasted for less than a minute.
日期 - 24/08/2021 00:01 - 24/08/2021 00:03
最後更新 - 24/08/2021 11:23

Network maintenance - WorldStream (已解決)

優先級 - 中
影響範圍系統 - nlcp01 - nlcp06
Between 6 am and 8 am on August 4th, WorldStream, the data center we use for all shared hosting servers, will perform network maintenance which requires updating the core routers and one distribution router to a new software release.

During this timeframe, there may be a short moment (a matter of seconds) when traffic switches from one router to another. This can occur multiple times during maintenance.

The network is fully redundant, and our systems are connected to two different routers, so impact remains minimal. It does put a higher risk on the uptime for the duration of the maintenance due to the lack of redundancy for some time.

The maintenance page at WorldStream will be kept up to date as well: https://noc.worldstream.nl/
日期 - 04/08/2021 06:00 - 04/08/2021 08:00
最後更新 - 18/08/2021 14:46

Kernel Upgrade - reboot required (已解決)

優先級 - 高
影響範圍系統 - All servers
On July 29th at 10.30 pm, we will perform a kernel update of all servers. This update is considered urgent due to the vulnerabilities CVE-2021-22555 and CVE-2021-33909.

We do expect up to 10-15 minutes of downtime per server.

The kernel is scheduled for release on July 27, so we're making room for a slight delay in the update being made available.

Update 29/07: The kernel update for servers nlcp01, nlcp02, nlcp03, and nlcp04 have been postponed until August 4th due to a discovered bug in the el7h kernel from CloudLinux. They're releasing a fix for this today, however, they only expect the full rollout of the kernel on August 3rd.

Servers nlcp05 and nlcp06 will still be updated today, since these rely on the el8 kernel which does not have this bug.

Update 29/07 10.29pm: We're starting with nlcp06

Update 29/07 10.36pm: We've completed nlcp06, starting nlcp05 shortly

Update 29/07 10.46pm: We've completed nlcp05.

Downtime for each server being about 5 minutes. nlcp01 to nlcp04 will be rebooted next week.

Update 04/08 10.23pm: We're starting with nlcp04 shortly.

Update 04/08 10.37pm: nlcp04 done, we're proceeding with nlcp03 shortly.

Update 04/08 10.47pm: nlcp03 done, we'll proceed with nlcp02 shortly.

Update 04/08 10.53pm: nlcp02 done, we'll proceed with nlcp01 shortly.

Update 04/08 10.59pm: nlcp01 done - all servers had a downtime of 3-4 minutes. Maintenance completed.
日期 - 29/07/2021 22:30 - 05/08/2021 03:00
最後更新 - 04/08/2021 22:59

Network maintenance (已解決)

優先級 - 中
影響範圍其他 - DC1 - Dronten - internal + managed VMs
Between 4 AM and 6 AM on the 27th of July, there will be performed a network maintenance in the Dronten datacenter managing some internal systems as well as some managed customers.

This is a part of emergency maintenance work to resolve stability issues with the network.

A total downtime of up to 1 hour can be expected during this timeframe.
日期 - 27/07/2021 04:00 - 27/07/2021 06:00
最後更新 - 29/07/2021 10:30

systems unavailable - crash (已解決)

優先級 - 重大
影響範圍系統 - nlcp01, nlcp02, nlcp05
At 10:15 we saw nlcp01 and nlcp02 becoming unavailable, and nlcp05 became unavailable at 10:19.

The systems were rebooted and the last came online at 10:24.

The cause of the crash is due to KernelCare live patching which triggered a so-called kernel panic.

We've disabled automatic patching for the time being and have informed KernelCare about the issues.

KernelCare has disabled the patch-set and estimates a fixed patch in the coming week.
日期 - 22/07/2021 10:15 - 22/07/2021 10:24
最後更新 - 22/07/2021 13:29

backup schedule reduced (已解決)

優先級 - 中
影響範圍系統 - NLCP05,NLCP06
We're temporarily decreasing the backup schedule from every 6 hours to once per day.

Due to a bug in how backups are linked together to produce incremental backups, this currently doesn't function correctly.
The result of this being a full backup is taken on every run, using up 700GB of disk space every 6 hours, which doesn't scale, we'd run out of disk space on our storage server fairly quickly.

We are, however, going to configure a second backup job, to perform backups of the databases every 6 hours for the time being.

As of 2.30pm, we've enabled full rotation again.
日期 - 04/05/2021 15:33 - 05/05/2021 14:30
最後更新 - 05/05/2021 15:59

DNS unavailable (已解決)

優先級 - 重大
影響範圍系統 - DNS
Between 15:59 and 16:02 we experienced complete unavailability of our DNS infrastructure due to an attack on a particular domain.

We've since implemented some additional measures to try to mitigate it in the future. We're still investigating why the attack happened in the first place.

We're sorry about the inconvenience caused by the outage.
日期 - 16/04/2021 15:59 - 16/04/2021 16:02
最後更新 - 16/04/2021 17:11

Reconfigure Backup Server (已解決)

優先級 - 低
影響範圍系統 - Backup server
Due to a recent change in our backup configuration which requires a lot more inodes than previously, we're required to reformat the backup system to use another filesystem that allows for a larger number than what ext4 gives with the current 20TB of storage we have.

As a result of that, as of today, we're performing backups to a separate location than usual (Two smaller physical servers), these servers are less powerful than the current system, meaning backups and restores may be slowed down slightly as a result of this.

These two servers will be used as a temporary storage location for new snapshots.

Normally we store 28 days of recovery points - as a part of this backup maintenance, we'll lower the number to 14 days, so we can get back to the original storage system again within a decent timeframe (14 days instead of 28 days).

If you do wish to keep a few additional snapshots for longer than this period, then you'll have to download these snapshots via cPanel within 14 days.

This maintenance is strictly required since we're nearing the limitations of the current inode count on the current filesystem.
日期 - 12/09/2020 12:09 - 15/01/2021 15:08
最後更新 - 15/01/2021 15:08

Acronis unavailable on some systems (已解決)

優先級 - 中
影響範圍系統 - Acronis Backup
We're investigating issues with Acronis recovery points being unavailable on some systems.

We're able to do restorations of files (not databases) via the Acronis Console directly - we've upped the JetBackup storage time from 7 days to 28 days, and enabled 6-hour snapshots in JetBackup as well, in the meantime.

While Acronis continues to perform backups, we're not able to restore them directly from cPanel, this also limits it to only files being able to be restored. Databases can be restored in a disaster recovery scenario, but it's not an easy task.

Please use JetBackup for the time being for performing restorations, while we're recovering the functionality in cPanel.

nlcp01 to nlcp10 and server9 are unaffected since they're using JetBackup by default as the main backup functionality.

ETA for resolution is currently unknown.

Update 7.18pm: After investigation together with the Acronis support and a bunch of debugging, the result so far is that some of the disksafes are corrupted after an attempted repair.

As a result of this, new disksafes has been made and are backing up again, however the recovery points prior to today are lost. All servers where the safes has been deleted, JetBackup has been doing backups cleanly, so recovery is still possible, however at a smaller timeframe than usual.

We're trying to get the disksafe for server16 to work properly, since in this particular case, we're only using Acronis backup - Jetbackup has been enabled on this server, however since there's only backups from today, recovery longer than that is currently not possible.

Update 12:07am: server16 disksafe are rendered corrupted, thus recovery is not possible, we're still checking on server15 and server8.

Update 8.24am: Disksafes server15 and server8 are only functional directly within Acronis console - so DB restores will only be available at the current Jetbackup rotation.

Additionally, we've disabled new Acronis backups on server16, so only the old ones are available, Jetbackup is used as the primary backup source moving forward.

We close the case, since backups are functional, despite the lack of some history on multiple servers - disaster recoveries are possible, and backups are available.

nlcp01 to nlcp10 will only use Jetbackup, despite being a bit more resource-heavy, it's providing the most reliable restoration and storage capabilities.
日期 - 24/08/2020 08:00 - 26/08/2020 08:30
最後更新 - 28/08/2020 10:56

server unavailable (已解決)

優先級 - 重大
影響範圍伺服器 - FRA16
Server16 is currently unavailable. Status can be followed at https://twitter.com/Hosting4Real/status/1285506453604900864

Update 13:06 - the system is back online
日期 - 21/07/2020 11:25 - 21/07/2020 13:06
最後更新 - 21/07/2020 13:10

Email delivery issues (已解決)

優先級 - 高
影響範圍系統 - Spam Filter
Earlier today we received notifications from some customers that their customers informed them regarding DNS lookup failures.

Initially, by the looks of the error message, it indicated a failure in the DNS resolution for the domains themselves, with the way our DNS servers are configured, it's unlikely this is caused by our DNS servers being unavailable (They're located in 4 different data centers, 4 different providers in 3 different countries). This first indicated a possible resolver issue at Microsoft due to the nature of it.

Upon further investigation, we saw that the resolution error was not related to the domains themselves, but to the mail exchange DNS (The NDR report from Microsoft didn't actually indicate this).

Testing our delivery we saw that the connections towards mx01.hosting4real.net and mx02.hosting4real.net would hang. Doing DNS lookups to resolve these two MX entries resulted in DNS timeouts.

Further testing from multiple Microsoft Azure locations, we saw that the DNS provider (Zilore) we use for the domain hosting4real.net was not actually reachable from within Microsoft's network. The Zilore NOC team was informed about the findings, the result being only DNS queries routed to Zilore's South African datacenter would have issues (Microsoft for some reason were routed to South Africa).

While Zilore worked out the issues with the DNS, we decided to implement a secondary MX domain into our spam filter solution, and start updating DNS entries for the customers affected by this.

We've thus added mx01 and mx02.h4r.eu as DNS for our spam filter servers, the h4r.eu domain uses another set of DNS servers for resolution (our standard ones).

Overall this should improve the availability even further due to the fact we a secondary DNS provider available for the spam filtering as well.

We're sorry about the inconvenience caused by this, the far majority of the emails will be delivered by Microsoft since they retry, however, in case they've reached their maximum retry limit, this will cause a failure, and will require the sender to send the email again.
日期 - 23/06/2020 12:40 - 23/06/2020 15:45
最後更新 - 23/06/2020 16:57

Servers unavailable (已解決)

優先級 - 重大
影響範圍其他 - RBX Datacenters
We're experiencing issues in Roubaix datacenters for server7, server8, server13, and server15 - we're investigating.

Update 5:34pm: The outage is caused by a network outage at the datacenters of Roubaix.

Update 5:43pm: The network seems to have returned to normal. We're still waiting for a reason for the outage to be provided by the datacenter. We continue to monitor the recovery of traffic to the affected servers.

Update 10.20pm: The network outage was caused by a router crash, the data center provider is investigating together with the network vendor to see what caused the issue. In the meantime parts of the router have been isolated and certain links (800gbps in total) have been reenabled to increase the capacity further towards Amsterdam and Frankfurt.

When the full investigation has been completed and this is announced, we'll update the post.

Update 11.10pm: As of 10.56pm the capacity has been increased to 2100gbps for the router.

RFO:

The root cause for the outage was caused due to a hardware failure of a daughter card (linecard) in the RBX-D1-A75 router, happening at the RAM parity level, this linecard originally raised alerts the 25th of March which the manufacturer confirmed at the 27th of March wouldn't be critical and should simply reboot the card during next maintenance.

Monday the 30th of March the errors appeared again on the same card, leading to the corruption of the software and preventing isolation of the card. The failed card propagated the corruption within the router so it became unstable and caused it to crash.

This means that 50% of the traffic passing through the Roubaix backbone would be affected since it would pass through this router pair.

For improvement, the provider are working on creating isolated availability zones to reduce the impact even further if it should happen again.

Additionally, regular redundancy tests will be performed at the backbone level.
日期 - 30/03/2020 17:14 - 30/03/2020 17:43
最後更新 - 26/04/2020 20:12

Payment gateway (已解決)

優先級 - 低
影響範圍系統 - Payment gateway
After the switch to Stripe, we saw that Nets the danish Card Issuer started to reject payments for our merchant ID - meanwhile Nets and Stripe are working on this, we've switched back to Braintree Payments to allow customers to pay with card again.
日期 - 19/09/2019 09:31
最後更新 - 15/12/2019 21:45

Backup system reinstallation (已解決)

優先級 - 低
影響範圍系統 - Backup system
Monday we'll reinstall our backup system due to an issue with the disk array that is unrecoverable, so we have to destroy the array and create it again (with new disks in it).

In the meantime, we've enabled a secondary backup server to take over backups for the time being until the original system is back up and running.

This also means we lose some backup history, we'll have backups for the beginning of October (Until the 7th or 8th of October), as well as starting from today the 25th.

Restoring backups from the 8-9th until 24th will not be possible since the data will be gone.
日期 - 28/10/2019 08:00
最後更新 - 15/12/2019 21:45

server7 / rbx7 unavailable (已解決)

優先級 - 重大
影響範圍伺服器 - RBX7
We're experiencing issues with server7.

At 00.18 the system rebooted, the reason currently is unknown

The system is copying a large file from /tmp to /usr/tmp.secure which is blocking services from starting.

We're sorry for the inconvenience caused by this.

Update: The server is online as of 00.57. We'll continue to investigate the root cause.

Update: After investigating the issue, we believe it has been a correlation between a MySQL lock caused by our backup software and a system package being updated in the same process (which happens to also affect the backup software).

We've went through our infrastructure to ensure that these two tasks doesn't run close to each other, as well as added additional logging to MySQL to see if it should happen in the future, we can see where the lock is caused.
日期 - 03/10/2019 00:18 - 03/10/2019 00:57
最後更新 - 05/10/2019 17:27

Exchanging payment gateway (已解決)

優先級 - 低
We're in the process of changing payment gateway from Braintree Payments to Stripe.

We've tested the integration and we can confirm that it works as expected.

As a part of our maintenance upgrading our billing system software to the new major version, we'll do the change of the gateway in the same maintenance window.

We'll perform our maintenance Monday during the day. We expect the maintenance to last roughly 2 hours.

Update 07:09: We'll start the maintenance

Update 07.36: Maintenance has been completed and we have confirmed that our gateways can process payments as expected.
日期 - 01/09/2019 20:20 - 02/09/2019 07:37
最後更新 - 02/09/2019 07:37

Server unavailable (已解決)

優先級 - 重大
影響範圍伺服器 - GRA10
We're currently experiencing issues with the availability of server10 - the datacenter is aware and working on a solution.

Update 25/07/2019 3.19pm: Since it's affecting the whole rack, we're expecting it to be the top of rack switches (both public and IPMI) that has shut down due to the heatwave currently hitting France.

Update 25/07/2019 3.31pm: The network has returned and the server is reachable again. We continue to monitor the situation.

Update 26/07/2019 07:14am: We've received the root cause of the problem as of 06.47am this morning.

The issue that happened yesterday, was a result of a misconfiguration on some of the switch equipment within the data center where the maximum temperature configuration of the switch was simply set too low, this has been corrected for the switches that resulted in downtime, and over the coming days the DC provider will ensure the consistency of the settings across their 25 data centers and thousands of switches.

OVH the provider we use for our main operation such as web hosting runs a quite unique setup when it comes to cooling data centers. Normal data centers are cooled by HVACs (Heating, ventilation, and air conditioning) , some will use a mix of HVAC and free cooling where you use a mix of the outside air to cool with if the air is cold enough. Some data centers will do water cooling in their racks using an indirect cooling method by having a loop in the rack that then chills the air to provide cold air for the servers.

OVH does actual direct water cooling on their systems, meaning every server has its own water loop (connected to a bigger loop), additionally, they have to circuits for this operation, the remaining cooling in the data centers is done at a "per room" basis with 1 water circuit that cools the air in the rooms.

Another loop will be added to the indirect air cooling system in every room, effectively doubling the capacity of the cooling system, and thus further lowering the temperature in the rooms.

Data centers located in cities where high temperature and high humidity is a possibility, additional air cooling systems are installed to cope with the heat. Whether this is expanded to other cities is on a "site per site" basis.

Early 2020, OVH will work on a new proof of concept to further improve the cooling capacities in their data centers.
日期 - 25/07/2019 14:57 - 25/07/2019 15:31
最後更新 - 26/07/2019 07:27

false positive downtime alert (已解決)

優先級 - 低
影響範圍系統 - server12,server14
At 4.52pm we received an alert about server14 being unavailable - however the server continued to receive traffic.

After investigation we saw that the monitoring system got graylisted by our Imunify360 web application firewall, thus failing our test marking the server as down.

At 4.59pm server12 alerted about downtime.

At 5pm the servers got marked as "online" again, after we implemented a fix.

The fix has been implemented on all servers to avoid these false positives in the future.
日期 - 14/06/2019 16:52 - 14/06/2019 17:00
最後更新 - 14/06/2019 17:24

Reboot to fix kernel bug (已解決)

優先級 - 中
影響範圍伺服器 - GRA14
We'll have to reboot server14 to fix a kernel bug - expected downtime will be roughly 5 minutes.

update 10.01pm: We'll reboot the server in a minute.

update 10.11pm: Server has been rebooted, total downtime being 5 minutes and 20 seconds.
日期 - 09/06/2019 22:00 - 09/06/2019 22:11
最後更新 - 09/06/2019 22:11

reboot of all servers (已解決)

優先級 - 高
影響範圍其他 - All servers
A recent vulnerability (Zombieload) in Intel CPUs, requires that we reboot all systems to install microcode updates to the CPU.

We expect somewhere between 5 and 10 minutes of downtime per server.

In rare cases there can be a boot problem, which will be resolved as quickly as possible, but the risk is there.

This update comes at a short notice, but due to the severity of the vulnerability, it cannot wait.

We're sorry about the inconvenience caused by this.

Update May 16: We'll be able to patch Zombieload without the need of reboots thanks to KernelCare. The patch is expected to arrive friday.

Update May 17: We have to reboot server7 to server12 this evening due to the CPU version we're using in those servers. We'll do one server at a time, starting with server7 at 8pm.

We're sorry for the inconvenience caused by this - however, the security of the systems is number 1 priority.

We'll also have to migrate a few customers in the coming weeks to rebalance the CPU usage - those customers will be contacted.

Update May 17 8.05pm: We're rebooting server7

8.14pm: server7 done, proceeding with server8

8.27pm: server8 done, proceeding with server9

8.37pm: server9 done, proceeding with server10

8.52pm: server10 done, proceeding with server12

9.04pm: server12 done, proceeding with server11

9.16pm: server11 done
日期 - 17/05/2019 20:00 - 17/05/2019 21:16
最後更新 - 17/05/2019 21:16

Hits stats wrong (已解決)

優先級 - 低
影響範圍系統 - ElasticSarch
Our statistics from ElasticSearch can be wrong due to a failure in data allocation, we've corrected the error on the cluster, but had to throw some log data away, because it's only temporary data we won't refill the data into the cluster, and just let it recover over the 7 days.
日期 - 21/10/2018 22:05 - 21/10/2018 22:05
最後更新 - 21/10/2018 22:06

Disk replacement (已解決)

優先級 - 重大
影響範圍伺服器 - GRA5
We have a failing disk in server5 (GRA5), and we have to replace the disk during the evening.
There will be downtime involved in the replacement.

We're scheduling the replacement somewhere around 10 pm and the disk will be replaced shortly after or during the night.

We're expecting the downtime to be roughly 30 minutes or less.
After the replacement, we'll rebuild the raid array.

We'll perform an additional dump of MySQL databases to our backup server prior to the replacement of the disk as a safety measure.

We're sorry for the inconvenience caused by this - but we need to ensure the availability of the raid array.

Update 20.15: We had a short lockup again, lasting for roughly 1 minute.

Update 21.00: We'll request a disk replacement in a few minutes.

Update 21.57: The server has been turned off, to get the disk replaced.

Update 22.09: The server is back online, services are stabilizing and raid rebuild is running
日期 - 20/09/2018 16:00 - 21/09/2018 13:00
最後更新 - 09/10/2018 13:02

Device upgrade GRA (已解決)

優先級 - 低
影響範圍系統 - Network
The data center will upgrade the top of rack switches in the Gravelines data center.

This will affect server10 (GRA10).

The maintenance will take place starting 11 pm the 18th of September and last until 6 am the 19th.

There will be a loss of network for up to 10 minutes.

Server5 and server9 got completed the night between September 12 and September 13
日期 - 18/09/2018 23:00 - 19/09/2018 06:00
最後更新 - 19/09/2018 08:04

Device upgrade RBX (已解決)

優先級 - 低
影響範圍系統 - Network
The data center will upgrade the top of rack switches in the Roubaix data center.

This will affect server6 (RBX6), server7 (RBX7) and server8 (RBX8)

The maintenance will take place starting 11 pm the 13th of September and last until 6 am the 14th.

There will be a loss of network for up to 10 minutes.

Update 07.05: Maintenance completed as of 03.42 am.
日期 - 13/09/2018 23:00 - 14/09/2018 06:00
最後更新 - 14/09/2018 07:06

Device upgrade RBX (已解決)

優先級 - 低
影響範圍系統 - Network
The data center will upgrade the top of rack switches in the Roubaix data center.

This will affect server6 (RBX6)

The maintenance will take place starting 11 pm the 19th of September and last until 6 am the 20th.

There will be a loss of network for up to 10 minutes.
日期 - 19/09/2018 23:00 - 20/09/2018 06:00
最後更新 - 13/09/2018 08:00

Top of Rack switch upgrades (已解決)

優先級 - 低
影響範圍系統 - Network
The data center is performing top of rack switch upgrades across RBX and GRA data centers, this means servers: server5, server6, server7, server8, server9 and server10 might be affected for up to 10 minutes, randomly during the night.

We're sorry for the inconvenience caused by this.

Update 07.55 am:
top of rack switches for server5 and server9 has been updated.

server10 is planned for the night between September 18 and September 19
server6, server7, server8 planned for the night between September 13 and September 14
日期 - 12/09/2018 23:00 - 19/09/2018 06:00
最後更新 - 13/09/2018 07:58

MultiPHP enabled (已解決)

優先級 - 中
影響範圍伺服器 - GRA4
We'll migrate this server to a MultiPHP setup to support future versions of PHP (7.0 and 7.1)

Currently the server runs with something called "EasyApache 3" (Provided by cPanel), we'll be upgrading to the new version called EasyApache 4 in our CloudLinux environment.

This also means that PHP Selector will be deprecated, meaning that custom module support won't be available.

Since this means removing old php versions (which was previously compiled from source), to a new set based on yum - it means a short downtime is expected.

As with any other (new) server we have, we're also switching from FastCGI to mod_lsapi first of all to allow the possibility for user.ini files and php_value settings - but more importantly because also mod_lsapi isn't as buggy as FastCGI is known for.

We've put a maintenance window of 2 hours, even though it shouldn't be needed, it should be sufficient in case any problems arise.
We're doing our best to keep the downtime as short as possible.

After this we'll be offering PHP version 5.6 (current version in use), 7.0 and 7.1.

We'll enable php 5.6 on all sites after we've upgraded as the default.

We do advise upgrading to 7.0 in case your software supports it.

Update 9.01pm: We're starting the update in a few minutes.

Update 9.23pm: We've completed the maintenance, we had a total downtime of 3-4 minutes meanwhile reinstalling the different versions.

We're doing some small motifications which won't impact services.
日期 - 21/01/2017 21:00 - 21/01/2017 21:23
最後更新 - 14/08/2018 11:37

Downtime (已解決)

優先級 - 低
影響範圍系統 - backup server
The backup server cdp03 will get moved to another data center, this means the server is unavailable from 9.15 am the 16th of July.
The server will come online again within 7 hours.

The server came online again at 11.49 this morning.
日期 - 16/07/2018 09:15 - 16/07/2018 11:49
最後更新 - 16/07/2018 17:22

Server unavailable (已解決)

優先級 - 重大
影響範圍伺服器 - GRA5
09.45: Server5 is currently unavailable, we're checking

09.54: About 30 racks in the data center seems to be affected by this outage - we're waiting for an update from the data center.

09.59: gra1-sd4b-n9 experiences network issues, the data center is working on restoring connectivity - server5 (GRA5) routes traffic via this linecard which caused unavailability as a result.

10.03: Services are returning to normal - a total of 17 minutes downtime was experienced - the data center moved traffic to gra1-sd4a-n9

10.26: There's packet loss on the server which can result in slower response times and possible intermediate failing requests.

10.38: The high packet loss only affects the primary IP of the server, all customers are located on secondary IP addresses meaning connectivity to websites will work.

Because of the primary IP has packet loss, it also means that outgoing DNS resolution and email delivery might be temporarily unavailable until it gets to an acceptable level again.

11.44: Connectivity to the primary IP has returned with 0% packet loss, email sending and delivery, DNS etc are once again working.

15.53: The outage was caused by a software bug in the Cisco IOS version that was used on the routers, when the bug triggered it would cause all active sessions to drop on the router and thus killing traffic. The data center switched the traffic to the standby router (gra1-sd4a-n9) for traffic to return, the data center then performed an upgrade of the router gra1-sd4b-n9 which fixed the bug, they switched traffic back to this router and performed the update on gra1-sd4a-n9.
日期 - 16/07/2018 09:45 - 16/07/2018 11:44
最後更新 - 16/07/2018 15:58

Server3a migration (已解決)

優先級 - 低
影響範圍伺服器 - GRA3
We'll migrate customers from server3a to new infrastructure

16/06/2018 we'll migrate a batch of customers to server9 - starting at 8.45pm
17/06/2018 we'll migrate a batch of customers to server9 - starting at 8.45pm
18/06/2018 we'll migrate the remaining batch of customers to server10 - starting at 8.45pm

Update 16/06/2018 8.36pm: We'll start migration in about 10 minutes.
Update 16/06/2018 9.49pm: migration for today has been completed.

Update 17/06/2018 8.36pm: We'll start migration in about 10 minutes.
Update 17/06/2018 9.35pm: migration for today has been completed.

Update 18/06/2018 8.41pm: We'll start migration in about 5 minutes.
Update 18/06/2018 9.35pm: migration for today has been completed. This means all accounts have been migrated from server3a.
日期 - 16/06/2018 20:45 - 18/06/2018 23:59
最後更新 - 18/06/2018 21:35

Account creation (已解決)

優先級 - 低
影響範圍其他 - cPanel
We currently have a slowdown in acceptance of orders due to capacity on existing servers.
We're in the process of setting up new infrastructure to accommodate the orders.

We expect orders to be accepted again by the end of today (Saturday 2nd June)

Update: 9.24 pm - A new server has been put into production
日期 - 02/06/2018 09:41
最後更新 - 02/06/2018 21:25

Replacement of Backup server (已解決)

優先級 - 中
We'll replace our backup server, this means we'll have to redo backups.

As a result older backups won't be restorable directly from cPanel, however, we can manually restore these if you create a ticket at support@hosting4real.net

16/04/2018:
We've started backups for all servers on the new backup system - we will keep the old backup server alive for another 14 days, after which we will start scrubbing the server for data.

03/04/2018:
We've decommissioned the old system.
日期 - 16/04/2018 12:45 - 03/05/2018 10:32
最後更新 - 03/05/2018 10:32

Patching of linux kernels (已解決)

優先級 - 重大
影響範圍系統 - All servers
Original:
Due to recent discovered security vulnerabilities in many x86 CPU's, we'll have to upgrade kernels across our infrastructure and reboot our systems.

We've patched a few systems already where the software update is available - we're waiting a bit with our hosting infrastructure until the kernel has gone to "production", and have been in production for roughly 48 hours to ensure stability.

We'll reboot systems one by one during evenings - we've no specific date yet when we'll start, but there will be downtime expected, hopefully only 5-10 minutes per server in case no issues happen.

Servers might be down for longer depending on how the system behaves during reboot, but we'll do anything to prevent reboot issues like with had with server3a recently.

This post will be updated as we patch our webservers, other infrastructure gets patched in the background where there's no direct customer impact.

The patching does bring a slight performance degradation to the kernel, the actual degradation vary depending on the workload of servers, so we're unsure what effect it will have for individual customers, it's something we will monitor post-patching.

Update 05/01/2018 5.23pm:
We'll update a few servers this evening, 2 of 3 vulnerabilities has will be fixed by this update, so we'll have to perform another reboot of the servers during next week as well when the update is available.

We do try to keep downtime at the absolute minimum, but due to the impact these vulnerabilities have, we rather perform the additional reboot of our infrastructure to keep the systems secure.

We're sorry for the inconvenience caused by this.

Update 05/01/2018 6.25pm:
We'll do as many servers as possible this evening, if we get no surprises (e.g. non-bootable servers), everything should be patched pretty quickly, we start from highest number towards lowest so as following:

server8.hosting4real.net
server7.hosting4real.net
server6.hosting4real.net
server5.hosting4real.net
server4.hosting4real.net
server3a.hosting4real.net

These 6 servers are the only ones that are directly impacting customers - for same reason, these restarts are performed during the evening (after 10pm) to minimize the impact on visitors.

Other services such as support system, mail relays, statistics, backups, will be rebooted as well - we redirect traffic to other systems as possible.

Expected downtime per host should be roughly 5 minutes if the kernel upgrades go as planned, longer downtime can occur in case the systems enter state where we have to manually recover it afterwards.

Update 05/01/2018 8.34pm:
server4.hosting4real.net will get postponed to tomorrow (06/01/2018) at earliest, since the kernel is still in "beta" state from CloudLinux, depending on the outcome we'll decide to either perform the upgrade tomorrow, or postpone to sunday.

For the other servers, we plan to start today at 10pm with server8 and after that proceeding with server7 and so on.

Update 05/01/2018 9.57pm: We start with server8 in a few minutes.
Update 05/01/2018 10.07pm: Server8 done, with 4 minutes downtime - we proceed with server7.
Update 05/01/2018 10.15pm: Server7 done, with 3-4 minutes downtime - we proceed with server6.
Update 05/01/2018 10.39pm: Server6 done, with 9 minutes downtime (high php/apache load) - we have to redo server7 since the microcode didn't get applied.
Update 05/01/2018 11.00pm: Server5 done, with 3 minutes downtime - proceeding with server3a.
Update 05/01/2018 11.13pm: Server3a done with 5 minutes of downtime - we'll proceed with server4 tomorrow when the CloudLinux 6 patch should be available.
Update 05/01/2018 11.49pm: Server5 experienced an issue with MySQL - the issue was caused by LVE mounts getting mounted before the MySQL partition (/var/lib/mysql) got mounted as it should, this resulted in MySQL being unavailable in a state that sites connecting using a socket (most sites do this) would not be able to connect, and all sites connecting via 127.0.0.1 would be able to connect just fine.

In our monitoring site we run on every server, we do not check for both TCP and socket connections towards MySQL being available, as a result the monitoring system didn't see this error directly and thus triggered an alarm.

We'll change our monitoring page to perform an additional check to connect both via TCP and via socket - we expect this change in our monitoring page to be completed by noon tomorrow.
We're sorry for the inconvenience caused by the extended downtime on server5

Update 08/01/2018 8.47pm: We'll patch server4 today, starting at 10pm. We'll try to keep downtime as short as possible, however - the change required here is slightly more complicated which increases the risk.

We're still waiting for some microcode updates that we have to apply to all servers once they're available - we're hoping for them to arrive by the end of the week.

Update 08/01/2018 9.58pm: We'll start the update of server4 in a few minutes.
Update 08/01/2018 10.15pm: We're reverting the kernel to the old one, since the new kernel has issues with booting. Current status is loading up the rescue image to boot the old kernel.
Update 08/01/2018 11.02pm: Meanwhile we're trying to get server4 back online, we've initialized our backup procedure and started to restore accounts from the latest backup onto another server to ensure customers getting online as fast as possible.

Update 08/01/2018 11.22pm: We've restored about 10% of the accounts on a new server.
Update 09/01/2018 00.56am: Information about the outage of server4 can be found here: https://shop.hosting4real.net/serverstatus.php?view=resolved - with title "Outage of server4 (Resolved)"

Update 10/01/2018 06.31am: A new version of microcodes will soon be released to fix more vulnerabilities, when the version is ready, we'll update a single server (server3a) to verify it's enabling the new features.

In case the features gets enabled, we'll upgrade the remaining servers (excluding server4) 24 hours later.

Update 16/01/2018 8.32am: We will perform a microcode update today on server3a to implement a fix for the CPU, this means we'll have to reboot the server, so with an expected ~ 3-5 minutes downtime. We will do the reboot after todays hosting account migrations which start at 9pm, so the server3a update will happen around 9.30 or 10pm.

Update 16/01/2018 9.28pm: We'll reboot server3a.
Update 16/01/2018 9.33pm: Server has been rebooted - total downtime of 2 minutes.
日期 - 05/01/2018 07:00 - 19/01/2018 23:59
最後更新 - 16/04/2018 17:21

Outgoing port 25/26 blocked (已解決)

優先級 - 低
影響範圍伺服器 - RBX8
We've blocked outgoing port 25/26 on RBX8 / server8 to prevent some spam sending.

Normal email flow won't be affected since it's going via another port towards our outgoing mail-relay.

If you're connecting to external SMTP servers - please use port 587.
日期 - 13/03/2018 10:59
最後更新 - 16/04/2018 17:21

Mail routing (已解決)

優先級 - 高
影響範圍系統 - Mail cluster
We experienced a mail routing issue for certain domains towards our mail platform which caused some receiving email to bounce (soft or hard depending on configuration).
This was caused by one of our mailservers rejecting email due to an issue, without the mailserver marking itself properly as down.

We've removed the mailserver temporaily meanwhile we investigate the issue, and will develop a check that checks for this mail-routing issue to prevent it in the future.
日期 - 28/01/2018 13:54 - 28/01/2018 14:55
最後更新 - 28/01/2018 15:30

Migration (已解決)

優先級 - 低
影響範圍伺服器 - GRA4
We'll migrate the remaining customers off server4 after the recent outage on the server.

We'll move the customers in batches:

Customers migrated on following days, will be migrated from server4 to server8:
15/01/2018
16/01/2018
17/01/2018
18/01/2018

Customers migrated on following days, will be migrated from server4 to server9:
22/01/2018
23/01/2018
24/01/2018
25/01/2018
26/01/2018

All migrations start at 9pm in the evening - accounts containing domains with external DNS will be migrated at specific times informed by emails.
Remaining customers will be migrated after 9pm and before midnight - each day.

Update 15/01/2018 9.00pm: We start the migration
Update 15/01/2018 9.37pm: Migration for today has been completed

Update 16/01/2018 9.00pm: We start todays migrations
Update 16/01/2018 9.24pm: Migrations for today has finished

Update 17/01/2018 9.00pm: We start todays migrations
Update 17/01/2018 10.04pm: Migrations for today has finished

Update 18/01/2018 9.00pm: We start todays migrations
Update 18/01/2018 9.19pm: Migrations for today has finished

Update 22/01/2018 9.00pm: We start todays migrations
Update 22/01/2018 9.45pm: Migrations for today has finished

Update 23/01/2018 9.00pm: We start todays migrations
Update 23/01/2018 9.34pm: Migrations for today has finished

Update 24/01/2018 9.00pm: We start todays migrations
Update 24/01/2018 9.27pm: Migrations for today has finished

Update 25/01/2018 9.00pm: We start todays migrations
Update 25/01/2018 9.32pm: Migrations for today has finished

Update 26/01/2018 9.00pm: We start todays migrations
Update 26/01/2018 9.02pm: Migrations for today has finished

We'll shut down server4 (GRA4) tomorrow (28-01-2018) at 2pm Europe/Amsterdam timezone and start wiping the disks monday morning.
日期 - 15/01/2018 21:00 - 26/02/2018 12:50
最後更新 - 27/01/2018 20:32

Outage of server4 (已解決)

優先級 - 高
影響範圍伺服器 - GRA4
Between 08/01/2018 10.04pm and 09/01/2018 00.13am we experienced a lengthy downtime on server4.

Below you'll find a detailed description on the outage and what actions we took during the downtime.

Background:

In the beginning of January it became known to the public that CPUs contain a big vulnerability that puts the security of all computer systems at risk, hardware vendors and operating system vendors has been working 24/7 to provide mitigation techniques to these security holes from getting exploited.

The exploits are called "Meltdown" and "Spectre" if you want to read more about them.

As a service provider, we're committed to providing a secure hosting platform, which means we had to apply these patches as well.

Overall we've been able to mitigate the vulnerabilities on the majority of our platform, however we had a single machine (server4) which ran an old version of CloudLinux 6.

To fix the vulnerabilities it requires a few things - updating the kernel (the brain of the operating system) and something called "microcodes". Microcodes are small pieces of code that allows CPUs to talk to the hardware, and allows hardware and CPU's to enable or diable specific features or technologies.

Both these things can be very risky to upgrade, and there's always a chance of a system not booting correctly during these upgrades, which is usually why we use software such as KernelCare to prevent rebooting systems and thus try to mitigate downtime.

However to fix these specific security vulnerabilities, it requires extensive changes to how the kernel of the operating system work, so patching a system while it's running with so big changes, can result in a lot of problems, such as crashing infrastructure, instability or serious bugs that can lead to various scenarios.

For the same reason, we and many others - decided to do an actual reboot of our infrastructure, since it's in general more safe to perform this action.

Todays event (08/01/2018):

10.00pm: We started upgrading the software packages on server4, more specifically we had to upgrade the kernel packages of the system as well as installing new microcodes for the CPU.
10.04pm: We verify that the grub loader is present in the MBR of all 4 harddrives in the server
10.09pm: We had an issue booting the server, so we did another restart and tried to manually boot from the drives one by one to see if the cause was due to a corrupt grub loader on one of the disks
10.12pm: We decide to revert to the old version of the kernel that we started out with, basically cancelling the maintenance.
10.15pm: Booting into CloudLinux rescue mode takes longer than expected due to slow transfer speeds to the IPMI device on the server.
10.23pm: We started reverting the old kernel, however during the reinstallation, the specific kernel we reverted to wasn't available on the install media
10.32pm: We reinitialize the install media with the specific kernel available to revert the system
10.44pm: Meanwhile the install media loads up, we start to prepare account restores to another server to get people back online faster
10.50pm: The installer media contains a bug in the rescue image that prevents us from actually continue the rollback of the kernel. We opened a ticket with CloudLinux - they're currently investigating the cause of the issue we saw.
10.54pm: We start restoring accounts from server4 to server8, prioritizing smaller (diskspace wise) accounts first to get the highest percentage online
11.22pm: 10% of the accounts has been restored - we continue the work on server4
11.42pm: 30% of the accounts has been restored
11.58pm: 43% of the accounts has been restored
12.13am: Server4 comes back online, we cancel all remaining account restorations - with 49% restored on server8.

The root cause of the system not booting, was due to corruption of the stage1 and stage2 files during the upgrade, it was resolved by booting the rescue mode image from the datacenter, and manually regenerate the stage1 and stage2 files as well as downloading the kernel files files we originally used.

This allowed us to get the system back online afterwards.

Current status:
Currently about half of the accounts has been restored to server8, these accounts will continue to be located on this server, since the hardware and software is never as well.
We're in the process of checking for inconsistencies in the accounts that was migrated to the other server.

We'll also update the records in our billing system to reflect the server change as well as emailing all customers moved to the new server about the IPs.

Upcoming changes:
We'll leave the current server "as is", and we'll plan migrating the remaining accounts in the following weeks to a new server, to ensure the security of our customers data.
日期 - 08/01/2018 22:04 - 09/01/2018 00:13
最後更新 - 09/01/2018 00:55

Support / billing system migration (已解決)

優先級 - 低
影響範圍系統 - Support / billing system
During the upcoming weekend (22-24 december), we'll migrate our support and billing system to a new infrastructure.

We're doing this to consolidate a few systems but also to move our support system out of our general network due to the recent outage we had that also affected our support system.

The migration will happen during day-time, there will be some unavailability for our support system and billing system during the migration - we're expecting about 5-15 minutes of total downtime for the system as a final step - we do this to ensure consistency in our data during migration.

In the period just around moving the actual database of the system to the new database server, we will put the system into maintenance mode, stop any possible import of tickets into our system, and then enable it again shortly after migration has been completed.

This also means that ticket import might be delayed for up to 15 minutes.
We do monitor our support email during the event, so if anything urgent comes up during the migration, we will see these emails.

We expect to complete the migration on the 23rd of december, with the possibility to postpone the migration until 24th of december.

Our email support@hosting4real.net will continue to work during the whole process.

Update 23/12/2017 11.15: We start the migration
Update 23/12/2017 11:49: Migration has been completed - we're doing additional checks to ensure everything works.
Update 23/12/2017 12:49: We've tested that payments work, as well as single signon directly to cPanel works as well.
日期 - 23/12/2017 07:00 - 24/12/2017 17:00
最後更新 - 05/01/2018 07:05

Migration of infrastructure (已解決)

優先級 - 低
影響範圍系統 - CDN Dashboard / management infra
We'll be migrating our CDN dashboard and management infrastructure to a new setup.

The migration will be made to speed up the dashboard and to simplify scaling the backend system in the future.

During the migration we'll disable the old system completely to ensure integrity of data and prevent duplicate systems from doing updates.

CDN traffic won't be affected by this migration, however the dashboard and API and purging functionality won't be available for a big part of the actual migration.
Statistics will be delayed and reprocessed afterwards to also avoid any duplicate or missing data.

[Update 15/12/2017 20:05]: We start the migration
[Update 15/12/2017 21:12]: Migration has been completed and DNS switched to the new infrastructure.
日期 - 15/12/2017 20:00 - 15/12/2017 21:12
最後更新 - 15/12/2017 21:13

Network maintenance (已解決)

優先級 - 低
影響範圍伺服器 - GRA4
December 12 starting at 10pm until 6am the 13th of December, the data center we use will perform a network equipment upgrade on the switch which server4 (GRA4) uses - the expected downtime for this maintenance is about 1 minute (moving the ethernet cable from one port to another) - the downtime will happen during the night so shouldn't impact any customer traffic too much.

The upgrade of equipment is to support growth and network offerings at the datacenter.

At 22.59 the system went offline (confirmed by monitoring)
At 23.01 the system came back online and most services returned to normal

A few IP addresses still didn't "ping" as they should, so we issued manual pings from both ends to ensure arp getting renewed on the routers, all services confirmed working at 23.05.

We're closing the maintenance as of 23.11
日期 - 12/12/2017 22:00 - 12/12/2017 23:11
最後更新 - 12/12/2017 23:11

Extended downtime on server3a (已解決)

優先級 - 重大
影響範圍伺服器 - GRA3
We experienced extended downtime on server3a (GRA3) yesterday between 3.37 pm and 5.14pm - lasting a total of 96 minutes.

The issue was initially caused by a kernel panic (basically making the core of the operating system confused, and thus causing a crash) - the kernel panic itself was caused by a kernel update provided by KernelCare which we use to patch kernels for security updates without having to reboot the systems.

The update that KernelCare issued contained a bug for very specific kernel versions - which affected one of our servers.

Normally when a kernel panic happens it will automatically reboot and comes online again a few minutes later - however in our case the downtime got rather lengthy due to the system not coming online again afterwards.

The issue for the boot was related to UEFI in the system - it couldn't find some information to be able to actually boot into the operating system - after trying multiple solutions we ended up finding a working one to get the system back online.

The specific error message we got, usually have multiple solutions because the error can be caused by multiple things, such as missing boot loader, corrupt kernel files or missing/misplaced EFI configurations - and that's what caused the downtime to last for the time it did.
日期 - 07/12/2017 15:37 - 07/12/2017 17:14
最後更新 - 08/12/2017 08:36

Multiple servers down (已解決)

優先級 - 重大
影響範圍系統 - RBX6,RBX7,RBX8,shop.hosting4real.net
Today between 08.07 and 10.38 Europe/Amsterdam timezone we experienced a complete outage on servers: RBX6 (Server6), RBX7 (Server7), RBX8 (Server8) as well as our client area/support system.

Timeline:

07:20: Receive an alert that ns4.dk - one of our four nameservers are down, nothing extraordinary - we run our nameservers at multiple data centers and providers so downtime can be acceptable in this case.

08.07: Receive alerts about a single IP on RBX7 being down we immediately start investigating

08.09: Receive alerts about the complete server RBX7 being down

08.12: Receive alerts about shop.hosting4real.net, RBX6, RBX8 being down and we realize it's a complete datacenter location outage (Roubaix, France) since it affects both RBX6, RBX7, RBX8 and shop.hosting4real.net which is entirely separate environments.

08.15: We see that connectivity towards all servers in RBX isn't reaching the data center location as it should and meanwhile verify if it affects our other location GRA where we have the other half of our servers.

08.27: It's informed to us that the outage affecting RBX are due to fiber optics issue that causes problems with routing towards RBX datacenters (7 datacenters in total with a capacity of roughly 150.000-170.000 servers)

08.50: It's confirmed that all 100-gigabit fiber links towards RBX from TH2 (Telehouse 2, Paris), GSW (Globalswitch Paris), AMS, FRA, LDN, and BRu are affected.

10.18: ETA for bringing up the RBX network is 30 minutes - the cause was corruption/data loss on optical nodes within RBX that cleared - configuration getting restored from backups

10.25: Restore of the connectivity is in progress

10.29: All connectivity to RBX has been restored, BGP recalculating to bring up the network across the world

10.33: We confirm that RBX7, RBX8, and shop.hosting4real.net once again has connectivity

10.38: RBX6 comes online with connectivity

What we know so far:

- The downtime for ns4.dk was caused by a power outage in the SBG datacenter - generators usually kick in, but two generators didn't work - causing the routing room to lose its power
- The downtime between SBG and RBX was not related - it just happened to be Murphy's law coming into effect.
- Outage in RBX caused by a software bug for the optical nodes in RBX to lose its configuration - DC provider is working with the vendor of the optical nodes to fix the bug together

We're still awaiting a full postmortem with the details from our provider, once we have it - an email will be sent to all customers that were affected by the outage.

There haven't been much we could do as a shared hosting provider to prevent this - we do try to keep downtime as minimal as possible for all customers. However, issues do happen, networks or part of them can die, power loss can happen and data centers can go down.

In this specific case, the problem was something outside our "hands" which we have no influence on and in this case the software bug from the provider could hardly be prevented.

We're sorry for the issues caused by the downtime today, and we do hope no similar case will happen in the future.

One additional step we will work on is completely moving shop.hosting4real.net out of our provider and locate it in another datacenter - this ensures that our ticketing systems stay online during complete outages. Our support emails did work during the outage, but we'd like to make it easier for customers to get in contact with us.

This post will be updated as we have more information available.

Postmortem from us and partially the provider:

This morning (9/11/2017) there was an incident with the optical network that interconnects the datacenter (RBX) with 6 of the 33 points of presence that powers the network: Paris (TH2, GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN) and Brussels (BRU).

The data centers are connected via six optical fibers, and those six fibers are connected to optical node systems (DWDM) that allow 80 wavelengths of 100 gigabits per second on each fiber.

Each 100 gigabit fiber is connected to the routers and two optical paths that are geographically distinct, in case there's an optical fiber cut the system will do a failover within 50 milliseconds - RBX is connected to a total of 4.4Tbps being 44x100 gigabit, 12x100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to Gravelines data centers and 2x100G to SBG data centers.

At 08.01 all 100G links (44x 100G) were lost, given the redundancy in place it could not be a physical cutoff of the 6 optical fibers and the possibility to do diagnostics remotely wasn't possible due to the management links were lost as well - for this manual intervention in the routing rooms has to take place: Disconnect cables and reboot the systems to make diagnostics with the equipment manufacturer.

Each chassis takes roughly 10-12 minutes to boot which is why the incident took so long.

The diagnostics:

All the transponder cards that are in use (ncs2k-400g-lk9, ncs2k-200g-cklc) were on "standby" state. One of the possible origins for this is the loss of the configuration. The configuration got recovered from a backup which allowed the systems to reconfigure all the transponder cards. The 100G links came back naturally, and the connection to RBX to all 6 POPs was restored at 10.34.

The issue lies in a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 supervision cards. Despite all the security measures the configuration disappeared. The provider will work with the vendor to find the source of the problem and to help fix the bug. The provider does not question the equipment manufacturer even though the bug is particularly critical. The uptime is a matter of design that has to be taken into account, including when nothing else works. The provider promises to be even more paranoid in terms of network design.

Bugs can exist, but they shouldn't impact customers. Despite investments in network, fibers, and technologies, 2 hours of downtime isn't acceptable.

One of the two solutions that are being worked on is to create a secondary setup for the optical nodes this means two independent databases, in case of configuration loss only one system will be down, and only half the capacity will be affected. This project was started one month ago, and hardware has been ordered and will arrive in the coming days. The configuration and migration work will take roughly two weeks. Given today's incident, this will be handled at a higher priority for all infrastructure in all data centers.

We are sorry for the 2H and 33 minutes of downtime in RBX.

The root cause in RBX:

1: Node controller CPU overload in the master frame
Each optical node has a master frame that allows exchanging information between nodes. On the master frame, the database is saved on two controller cards.

At 7.50 am there was detected communication problems with nodes corrected to the master frame which caused a CPU overload.

2: Cascade switchover

Following the CPU overload of the node, the master frame made a switchover of the controller boards - the switchover and CPU overload caused a bug in the Cisco software - it happens on large nodes and results in a switchover every 30 seconds. The bug has been fixed in Cisco software release 10.8 which will be available at the end of November.

3: Loss of the database

At 8 am, following the cascade switchover events, it hit another software bug that de-synchronizes timing between the two controller cards of the master frame. This caused a command to be sent to the controller cards to set the database to 0, which effectively wiped out the database.

The action plan is as follows:

- Replace the controllers with TNCS instead of TNCE - this doubles CPU and RAM power - replacement will be done for Strasbourg and Frankfurt as well.
- Prepare to upgrade all equipment to Cisco software release 10.8
- Intermediate upgrade to 10.5.2.7 and then upgrade to 10.8
- Split large nodes to have two separate nodes

Compensation:

10/11/2017 20.24: Accounts on Server8 (RBX8) has been compensated
10/11/2017 20.51: Accounts on Server7 (RBX7) has been compensated
10/11/2017 22.43: Accounts on Server6 (RBX6) has been compensated
日期 - 09/11/2017 08:07 - 09/11/2017 10:38
最後更新 - 13/11/2017 11:26

Cooling check - possible downtime (已解決)

優先級 - 重大
影響範圍伺服器 - RBX6
We detected an issue with the water cooling of our server meaning that the CPU runs a lot hotter than it should do (currently at 90C) which also causes throttling in terms of performance, we've scheduled an intervention with the datacenter engineers around midnight today (between 3rd and 4th november) - depending on the outcome we might have to temporary shut down the server for the cooling block to be replaced.

In case the issue lies in the cabling going from the cooling loop to the server itself, then this can often be replaced without shutting down the server.

We're sorry about the inconvenience and the short notice.

Update 03/11/2017 11.53pm: Intervention will start shortly

Update 04/11/2017 12.08am: Machine has been shutdown for the datacenter to perform the intervention.

Update 04/11/2017 12.25am: Machine back online

Update 04/11/2017 12.31am: Load on system became normal, verified that temperatures are now correct.
日期 - 04/11/2017 00:00 - 04/11/2017 00:31
最後更新 - 04/11/2017 00:32

Downtime on server4 (已解決)

優先級 - 高
影響範圍伺服器 - GRA4
Today we experienced a total of 13 minutes of downtime on server4.

At 10.28 the number of open/waiting connection on Apache on server4 increased from the average of about 20 to a bit above 300.
We've been aware of the issue since it happened before but we've had the possibility to find the root cause of the issue.
However today our monitoring notifications were delayed, meaning we first got informed about the downtime at 10.40, so 12 minutes after the webserver stopped responding to traffic.

We resolved the issue by forcing an apache restart, we found out that the issue is caused when some crawlers open up connections to the server but never closes them again - this usually only happens for rogue crawlers, but it turns out that this also happens in some cases when people do use legit SEO crawlers such as Screaming Frog SEO Spider.

Our immediate fix is to block anything containing the screaming frog user-agent, this agent can easily be changed when you're paying for the software, so the solution isn't exactly bullet-proof.

The issue happens because the application opens up connections for every request it makes, and it doesn't close the connection correctly again.
This results in all workers on the webserver being used up by a crawler.

We've informed the company behind Screaming Frog SEO Spider about the issue and what causes the exact result and advised them to implement keep-alive support or actually close connections correctly.
During further investigation we saw that this only happens when using very specific settings in the software.

We're sorry for the long downtime caused by this, we have additional measurements we can use against it by rate-limiting IP's to a set of connections, however we'd prefer to not do this since it can cause issues with legit traffic as well.
日期 - 06/09/2017 10:28 - 06/09/2017 10:41
最後更新 - 06/09/2017 16:10

Disk replacement (已解決)

優先級 - 中
影響範圍伺服器 - GRA3
We have a failed disk in server3a (GRA3), we'll ask the datacenter for a replacement.

There will be a downtime with this intervention since the provider will have to reboot the system.

We're performing a backup of all customer data and databases before requesting the replacement.

Update 00:20: The server is getting a disk replacement now.

Update 01:00: We experienced some problems so intervention is still ongoing.

Update 01:25: System is back up using another kernel, we're going to reboot again after some investigation to bring up all sites.

Update 01:35: The system is back online serving traffic as it should. The raid will rebuild for the next 40-60 minutes.

Complete history below:

- 11.37 PM the 27th of August we received an alert from our monitoring system that a drive started to get errors meaning that a drive failure can happen within minutes, days or sometimes even months. Since we want to avoid the possibility of dataloss we decided to schedule an immediate replacement of the drive.
- 12.01 AM the 28th of August we finished a complete backup of the system, we perform this backup since there's always a risk that a disk replacement goes wrong or that a raid rebuild will fail.
- 12.15 AM We receive a notification from the datacenter they'll start disk replacement 15 minutes later (Automated email)
- 12.22 AM We see the server go offline for the intervention, disk gets replaced.
- 12.40 AM We receive an email from the datacenter that the intervention has been completed and our system is booted into rescue mode
- 12.42 AM We change the system to boot from the normal disk to get services back online, however due to a fault with the IPMI device (basically a remote console to access the server), we couldn't bring the service back online.
- 12.44 AM We call the datacenter to request a new intervention, which is already being taken care of.
- 12.50 AM to 01.23 AM the engineer intervene, and spot that there's a fault on the IPMI interface and have to perform a hard reset from the motherboard, and at same time the engineer realize that the boot order has to be changed in the BIOS to boot from the correct disk which wasn't performed during the first intervention.
- 01.25 AM The server comes back online with a default CentOS kernel to ensure the system would boot, this action was performed by the datacenter engineer.
- 01.34 AM We restart server to boot from the correct CloudLinux LVE kernel.
- 01.35 AM All services restored.
- 01.36 AM We start the rebuild of the raid array.
- 01.52 AM Rebuild currently at 39%
- 01.56 AM Rebuild currently at 51%
- 02.08 AM Rebuild currently at 79.5%
- 02.16 AM Rebuild completed

We're sorry about the issues caused and the increased downtime. Usually these hardware interventions should be performed with minimal downtime (about 15 minutes start to finish), however due to a mistake from the engineers side and due to the fact the system had an issue with the IPMI interface the downtime sadly became 1 hour and 15 minutes.
日期 - 27/08/2017 23:22 - 28/08/2017 01:52
最後更新 - 28/08/2017 02:16

server8 downtime (已解決)

優先級 - 中
影響範圍伺服器 - RBX8
The downtime at server8 (RBX8) was caused due to automatic software update:

Usually software updates happen smoothly, only causing a few seconds of unavailability in certain cases, however today a new version of cPanel got released - version 66.
Together with this version CloudLinux released an update to their Apache module mod_lsapi which offers PHP litespeed support for Apache.

The update of this module changes how PHP handlers are configured, basically it used to be in a specific config, where it is now moved to the MultiPHP handler setup within cPanel itself.
The update removed the handler the lsapi.conf file (lsapi_engine on) but did not update the handlers within the MultiPHP handler configuration in cPanel.

This caused the system to stop serving PHP files after an apache restart, the quick fix we did was to enable PHP-FPM for all 4 accounts on the server which brings sites back online and meanwhile we'd investigate on a test domain on the same server what caused the downtime.

We did manage to fix it in an odd way without really knowing what fixed it, we continued with another server but enabled PHP-FPM on all accounts before performing the update to ensure sites wouldn't go down.
We were able to reproduce the problem, and therefore created a ticket with CloudLinux to further investigate and the above findings were discovered.

We've updated about half of servers to the new version, the rest of them do currently have automatic updates disabled and we'll continue to update these servers on thursday evening.

We're sorry for the issues caused by the downtime, we've sadly had some issues lately with CloudLinux pushing updates out containing bugs.
日期 - 22/08/2017 22:14 - 22/08/2017 22:20
最後更新 - 23/08/2017 01:26

Opdatering af software (已解決)

優先級 - 中
影響範圍伺服器 - GRA5
Den 29 juli kl 22.00 vil vi lave en større software opdatering på server5 (GRA5).

Denne server bruger nuværende PHP Selector fra CloudLinux samt EasyApache 3 for at vedligeholde software versioner så som PHP samt Apache.

Da EasyApache har fået en større opdatering der hedder "EasyApache 4", og derved gor at EasyApache 3 vil blive forældet indenfor nærmere fremtid, er vi nødsaget til at opdatere softwaren.

Opgraderingen bringer en række muligheder som allerede er tilgængelige på alle andre servere vi har:

- Forskellig PHP version (5.6, 7.0 og 7.1) per side og ikke kun per konto - dvs. det bliver muligt at en side kører med PHP version 5.6 og en anden kører med 7.0.
- Mulighed for http2 - dette giver en forbedring i hastighed for levering af statiske filer

Samtidig tager EasyApache 4 brug af "yum" til at installere pakker hvorimod EasyApache 3 skulle bygge PHP versioner og Apache versioner manuelt hver gang, ved skiftet sikre det hurtigere opdateringer til nyere Apache og PHP versioner, samt vi har muligheden for at tilbyde mere funktionalitet.

Denne email udsendes da der vil være forbundet en mindre nedetid under skiftet til det nye software, generelt set vil det være ca. 1 minut for alle kunder - dog vil vi manuelt skulle skifte PHP versionen for alle kunder der bruger en version der ikke er vores standard, så for disse konti vil der være en smule længere nedetid såfremt din side ikke understøtter PHP 5.6 (Det gør det meste software).

Skulle du have spørgsmål kan du altid kontakte os på support@hosting4real.net

----------------------------

We'll perform a major software maintenance on server5 (GRA5) on July 29th starting 10pm.

This server currently uses PHP Selector offered by CloudLinux as well as EasyApache 3 to perform software updates for PHP and Apache.

Since EasyApache some time ago got a major upgrade to "EasyAapche 4" and thus making EasyApache 3 obsolete within a short period of time, we find it necessary to perform this software upgrade.

The upgrade brings a few new possibilities which are already possible on all other servers we have:

- Different PHP versions (5.6, 7.0 and 7.1) per website and not only per account - this means it will be possible to run PHP version 5.6 on one website and 7.0 on another.
- Possibility to use http2 - this can greatly improve the website performance if you have a large amount of static files.

At same time EasyApache 4 makes use of "yum" to install packages where EasyApache 3 had to manually compile newr versions of PHP and Apache - the change will ensure faster updates to newer Apache and PHP versions and we can offer better functionalities.

We send you this email since there will be a minor downtime during the migration to the new software - generally speaking we're talking about 1 minute for all customers - due to switching PHP versions manually for all customers that uses a version that is the server standard, might experience a slightly longer downtime in case your website doesn't support PHP 5.6 (Most software does).

In case you have any questions, please do not hesitate to contact us at support@hosting4real.net

[Update 21.51]:
We start the update shortly

[Update 22:07]:
The software upgrade has been completed - continuing with finalizing configurations

[Update 22:34]:
PHP versions on hosting has been reset to what they originally was configured to

PHP 7.0 has been set as default

[Update 22:55]:

All modules has been configured as they should.
Maintenance has been completed.

[Update 02:20]:

We experienced a short downtime shortly after the migration:

Due to some race conditions in the way the systems get converted, when we changed the default PHP handler it would result in the PHP handling to stop working as it should.

At same time we experienced that certain PHP versions would load from the new system, but other versions would load from the old system, this in itself doesn't do any harm, but it would mean we would have to maintain two different configurations, and due to a rather complex version matrix on how PHP versions are selected, we wanted to correct this to simplify the management.

After consulting with CloudLinux we resolved both issues and at same time kindly asked CloudLinux to further improve their documentation on the subject since there's plenty of pitfalls which can result in strange results.

We're sorry for the additional downtime caused after the update.
日期 - 29/07/2017 22:00 - 29/07/2017 23:59
最後更新 - 30/07/2017 02:27

Network outage (已解決)

優先級 - 高
影響範圍系統 - server3a,server4,server5
Between 11.15 am and 11.20 am we experienced a network outage on server3a, server4 and server5 - all located in Gravelines datacenter - this resulted in a 90% drop in traffic (some traffic could still pass through). The issue was quickly resolved by our datacenter provider. We're waiting for an update from the datacenter provider what happened, and we'll update this post accordingly.
日期 - 26/07/2017 11:15 - 26/07/2017 11:20
最後更新 - 26/07/2017 11:31

Customer migrations (已解決)

優先級 - 低
影響範圍伺服器 - RBX2
Over the next days we'll be migrating customers from RBX2 to RBX7

01/06/2017 21.00: We're starting migrations for today
01/06/2017 22.28: Migrations has been completed

02/06/2017 20.55: We will begin migrations at 21.00
02/06/2017 22:23: Migrations has been completed

03/06/2017 20.50: We will begin migrations at 21.00
03/06/2017 22.17: Migrations has been completed

04/06/2017 20:56: We'll begin migrations shortly
04/06/2017 21:51: Migrations has been completed

05/06/2017 20.55: We will begin migrations shortly
05/06/2017 21.31: Migrations for today has been completed

06/06/2017 20.55: We will begin the migration for today shortly
06/06/2017 21.14: Migrations has been completed

09/06/2017 21.00: We start migration
09/06/2017 21.48: All migrations has been completed.
日期 - 01/06/2017 21:00 - 09/06/2017 22:00
最後更新 - 09/06/2017 21:48

Downtime on server7 (已解決)

優先級 - 重大
影響範圍伺服器 - RBX7
Server7 went down this morning, it randomly rebooted and went into a grub console.
After investigation we managed to boot the server.

We're currently performing a reboot test to see if the same issue would happen again.

We'll keep you updated.

Update:
The system still goes into a grub console, we'll migrate the very few customers on the server (it's completely new), to one of our older servers and get it fully repaired.
When the job is done we'll move the customers back again.

Update 2:
After further investigation, the issue was caused by an update of the operating system - normally when kernels are updated, it has to rewrite something called a grub.cfg file to include the new kernel, this happens as it should, but since we're using EFI boot on the new server, a bug in the operating system caused it to not write the grub.cfg correctly to the location where EFI will look for the config.

We happened to have a kernel crash this morning, which caused the system to restart in the first place, rendering the error we had.
After creating the grub.cfg manually and doing a few reboots, we confirmed that it was the cause.

The vendor has been contacted and acknowledged the problem.

On the bright side, this crash happened just days before we actually started our migration, meaning it only caused downtime for a few customers and not everyone from server2a.

We will continue with the migration as normal.
日期 - 28/05/2017 08:09
最後更新 - 28/05/2017 11:28

New site migrations (已解決)

優先級 - 低
影響範圍其他 - General
From May 5 until May 22, no new customer site migrations will be performed.

This means in case you want to move to us, and want to get migrated by our staff, you have to wait until the 22nd of May.
日期 - 05/05/2017 17:00 - 22/05/2017 06:00
最後更新 - 28/05/2017 08:51

Server6 downtime (已解決)

優先級 - 高
影響範圍伺服器 - RBX6
Around noon we experienced two short outages right after each other on server6.
Total downtime registered were 2 minutes and 42 seconds.

At 12.12 a big number of requests started to come in, this quickly filled up the amount of apache workers.
As a way to try to resolve it, we increased the amount of workers allowed to process requests, but due to the backlog of requests coming in, the load were already too high to gracefully reload apache.

As a result we decided to stop all apache processes - basically killing all incoming traffic (Yes, we know it's not nice).
We switched back to mpm_event in Apache which are known to handle a large amount of requests better, right after this traffic came back, with no issues whatsoever.

On all our servers we usually run mpm_event by default, but due to a bug between mod_lsphp and mpm_event that was discovered a few weeks back, we switched mpm_event to mpm_worker on all servers to prevent random crashes from happening.

This bug is still present, but has been fixed upstream, we're just waiting for the packages to be pushed towards general availability, which is why we've not yet officially planned to switch back to mpm_event again.

Using mpm_worker in general isn't an issue as long as the traffic pattern is quiet predictible - so it can handle large amount of traffic, if the traffic goes up in a normal way.
It's known for mpm_worker to get overloaded in case you go from a small amount of requests to a high amount in a matter of seconds (which is what happened today) - this caused Apache to just stop accepting more traffic.

Since the bugfix has not officially been released for mpm_event, we'll not perform change to mpm_event again for servers until it has been released.
But in this specific case we've had the need to change, to cope with the traffic spikes that happens from time to time.

We've a few changes to our systems to prevent the bug from happening, but it's by no means, a permanent solution, until the fix gets released - however, in case we will experience the bug, it will cause about 1 minute of downtime, until we manage to access the machine, and manually kill and start the webserver again.

We're sorry about the issues caused by this outage.
日期 - 02/05/2017 12:12 - 02/05/2017 12:16
最後更新 - 02/05/2017 12:45

Planned network maintenance (已解決)

優先級 - 高
影響範圍伺服器 - RBX6
On March 29th, there will be a planned network maintenance that will impact connectivity towards server6 (RBX6).

Our datacenter provider are performing the network maintenance to ensure the providing a good quality of service in terms of networking. For this to be carried out, they will have to upgrade some network equipment that are located above our server.

The equipment update are being performed to follow the innovation happening in the network, and to allow our datacenter provider to introduce new features in the future.

To be more exact the "FEX" (Fabric Extender) will have to be upgraded, meaning they'll have to move cables from one FEX to another one, which does involve a short downtime.

There will be a total of two network "drops":

- The first one are expected to last for about 1-2 minutes since the datacenter engineers have to physically connect the server to the new FEX.
- The second one happens 45 minutes after the first one and are expected to last for only a few seconds.

The maintenance has a window of 5 hours, since there's multiple upgrades being performed during the night, and it's unclear how long each upgrade will take, therefore we cannot give more specific timings than the announced maintenance window from our datacenter provider.

The maintenance performed by the provider can be followed at their network status page: http://travaux.ovh.net/?do=details&id=23831

Update 29/03/2017 22:00: OVH has started the maintenance.

Update 30/03/2017 02:41: They managed to finish the first rack but due to a issue with their monitoring systems and extended maintenance time, the second rack (where we're located in) has been postponed. When we know the new date it will be published.

Update 30/03/2017 12:53: The maintenance for our rack has been replanned for April 5th - between 10pm and 3am (+1 day)

Update 05/04/2017 21:56: The maintenance for our rack has been postponed again - we're waiting for a new date for the FEX replacement.

Update 21/04/2017 19:03: Info from DC: The first intervention, will impact the public network and will take place on 26th of April 2017 between 10pm and 6am.

During the time as informed earlier, there will be two small drops in the networking during that time.

Update 27/04/2017 22:22: Maintenance on our rack has begun

Update 28/04/2017 00:05: Maintenance has been completed

Best Regards,
Hosting4Real
日期 - 26/04/2017 22:00 - 26/04/2017 00:05
最後更新 - 27/04/2017 00:52

Downtime on server6 (已解決)

優先級 - 中
影響範圍伺服器 - RBX6
Between 00:50:34 and 00:54:00 we experienced an outage on server6.

During these 3.5 minutes, Apache wasn't running, meaning sites hosted weren't accessible.

After investigating the issue we saw that the system in cPanel that maintains the update of mod_security rulesets, were updating the ruleset we use - this means the system will also schedule a graceful restart of Apache in the process, but due to a race condition in the system, the restart of Apache happened meanwhile it was rewriting some of the configuration files.
This means during the restart the Apache configuration will be invalid - which resulted in Apache not being able to start again, thus causing the downtime.

The issue is known by cPanel, and they're working on fixing it.

As a temporary workaround, we've disabled auto-update of mod_security rules across our servers, to prevent this from happening again.
We will enable the automatic update of the rulesets when the bug has been fixed.

We're sorry about the issues caused in the event of this outage.
日期 - 14/04/2017 00:50 - 14/04/2017 00:54
最後更新 - 14/04/2017 14:55

DDoS on server3a (已解決)

優先級 - 中
影響範圍伺服器 - GRA3
We're currently experiencing a ddos on server3a, our DDoS filtering works, but it might cause a bit of packet loss, and/or false positives for few monitoring systems.

Response times on websites might be a bit higher, since all traffic has to be filtered, other than that, traffic flows, and we do not see any general drop in traffic on webserver level.

Update 19:10: Attack has stopped - we move the IPs out of mitigation again.
日期 - 10/04/2017 14:30 - 10/04/2017 19:10
最後更新 - 10/04/2017 21:49

Change of MySQL settings (已解決)

優先級 - 中
影響範圍系統 - All systems
Over the past few days we experienced short outages on server4 (GRA4), and this morning a short outage on server2a (RBX2)

After further investigation, we narrowed down the problem to be an issue with the configuration of MySQL.
Basically the max query size we allow in the query cache was defined to a rather big amount, which generally do not cause any issues, except in a very few edge cases.

For same reason we only saw this happening mostly on a single server, because it happens to execute these very extensive queries affecting the performance, and causing some rather bad locks on the system.

We applied the changes to both server2a and server4.

We're currently applying the same settings to every other server, it will result in restarting MySQL on every server - this only takes a few seconds, so no downtime is expected during the restart.
Be aware that if you monitor your site, and the monitoring system do cause a check immediately as we restart, it might trigger a false positive.

Update 14.00: Server5 has been updated.
Update 14:03: Server6 has been updated.
Update 14:04: Server3a has been updated.
日期 - 04/04/2017 13:52 - 04/04/2017 14:04
最後更新 - 04/04/2017 14:04

Emergency maintenance (已解決)

優先級 - 高
影響範圍伺服器 - GRA4
We have to do an emergency maintenance on server4/GRA4 tonight starting at 9PM.

The maintenance carried out will fix a critical issue on the system that we've detected.
During the maintenance we might experience a small unavailability of websites, we do try to keep the downtime as low as possible.

We're sorry to give such short notice.

Update 21.00: Maintenance starting

Update 21.17: We've finished the maintenance, there was no impact during the fix
日期 - 21/03/2017 21:00 - 21/03/2017 21:17
最後更新 - 21/03/2017 22:03

System migration (已解決)

優先級 - 中
影響範圍伺服器 - PAR3
As informed by email - all customers on PAR3 (server3) will be migrated to GRA3 (server3a).

We'll migrate customers over a period of about 2 weeks:

- 17/02/2017: Starting at 21.00
- 18/02/2017: Starting at 21.00
- 19/02/2017: Starting at 21.00
- 24/02/2017: Starting at 21.00
- 25/02/2017: Starting at 21.00
- 26/02/2017: Starting at 21.00
- 03/03/2017: Starting at 21.00

Starting tonight we have an amount of customers with external DNS which we migrate at very specific times. All other customers will be informed shortly before their migration, and shortly after the account has been moved to the new server.

Please make sure if you're using server3.hosting4real.net as your incoming/outgoing mail-server or FTP server, that you update this to server3a.hosting4real.net, or even better - using your own domain such as mail.<domain.com> and ftp.<domain.com>.

Update 17/02/2017 21.00: We start the migration

Update 17/02/2017 22.13: We've migrated the first batch of accounts, we'll continue tomorrow evening. We haven't experienced any issues during the migration.

Update 18/02/2017 20.58: We start the migration in a few minutes.

Update 18/02/2017 21:30: We're done with migrations for today.

Update 19/02/2017 20:56: We'll start the migration in a few minutes.

Update 19/02/2017 21:07: We're done with migrations for today.

Update 24/02/2017 20:58: We'll start the migration in a few minutes.

Update 24/02/2017 21:28: We're done with migrations for today.

Update 25/02/2017 20.57: We'll start the migration in a few minutes.

Update 25/02/2017 22:06: We're done with migrations for today.

Update 26/02/2017 20.58: We'll start the migration in a few minutes.

Update 26/02/2017 21:06: We're done with migrations for today.

Update 03/03/2017 20.55: We'll start the migration in a few minutes.

Update 03/03/2017 22:03: We're done with all migrations.
日期 - 17/02/2017 21:00 - 03/03/2017 23:30
最後更新 - 03/03/2017 23:19

.dk registrations (已解決)

優先級 - 中
影響範圍系統 - Registrar system
The service where we register .dk domains are currently having a service outage affecting only this specific TLD.

We're waiting for the system to come back online to process .dk registrations again.

Update 14:35: The issue was caused by a routing issue between the registrars network and the registry.
日期 - 22/02/2017 12:00
最後更新 - 22/02/2017 14:42

Network outage (已解決)

優先級 - 重大
影響範圍其他 - All servers
At 09.19 we received alerts from our monitoring about unreachability to multiple servers in multiple datacenters.
After quick investigation we saw between 40 and 90% packet loss to multiple machines - it means majority of the traffic would still go through, but response times being really high.
From our network traces it showed that traffic were being routed via Poland (which is usually shouldn't), so we knew it was a network fault.

After updates from the datacenter, it seems to be caused by an issue with GRE tunnels. We're waiting for further updates from the datacenter.

Some servers were affected more than others, some customers might have not experienced issues - but it started at 09.19 - and recovered fully at 09.30 - so a maximum downtime of 11 minutes.

We'll update this accordingly when we have more information.
日期 - 08/02/2017 09:19 - 08/02/2017 09:30
最後更新 - 08/02/2017 10:06

Global outage (已解決)

優先級 - 重大
影響範圍其他 - All systems
We experienced an outage across multiple systems.
The outage was caused by a faulty IP announcement (64.0.0.0/2) on the global network.
This caused a BGP routing issue and thus causing the outage.

A small subset of customers were still able to connect to the servers based on which subnets they were located in.

The incident happened from 9.22 am to 9.37 am

Our apologies for the issues caused on behalf of our supplier.
日期 - 06/01/2017 09:22 - 06/01/2017 09:37
最後更新 - 06/01/2017 10:07

Server downtime (已解決)

優先級 - 高
影響範圍伺服器 - RBX2
Starting today at 09.58 until 10.48 we experienced a total downtime of 13 minutes on server2a

This was over a course of 3-4 downtimes, the longest lasting for 10 minutes.

The root cause was an unusual high amount of traffic on a specific site, that caused all apache processes and CPU to be eaten up.
This caused high load on our systems which resulted in connections being blocked.

We've worked on a solution with the customer to minimize the load on the system.

We know the downtime, specially on this day is not acceptable. The box are one of our older ones which we will soon replace.

The new box will be like our others, which includes software to stabilize systems in case of high load.

We're sorry for the problems and downtime caused today.
日期 - 25/11/2016 09:58
最後更新 - 25/11/2016 11:33

Possible network maintenance in Gravelines (已解決)

優先級 - 低
影響範圍系統 - GRA-1
We've received notification of a possible network maintenance tonight at 22.00 Europe/Amsterdam time in one of the locations.
This might affect the availability of GRA4 (server4) and GRA5 (server5) for up to 15 minutes.

Currently the maintenance can be cancelled - since it's currently depending on another maintenance.

There's currently updates happening in the network which means firmware needs to be upgraded on something called a fabric extender - this meaning the extender has to be reloaded, which can cause a minor outage for the outside world.

We'll update this page as soon as we know if the maintenance will happen tonight, and what the exact impact will be.

Update: We got the confirmation that our IP subnets won't be affected anyway during this maintenance.
日期 - 23/06/2016 22:00 - 23/06/2016 22:30
最後更新 - 23/06/2016 19:32

Server unavailable for 15 minutes (已解決)

優先級 - 高
影響範圍伺服器 - GRA4
At 17/06/2016 01:44 GMT+2 server4 (GRA4) experienced a network outage.
At 17/06/2016 01:59 GMT+2 server4 (GRA4) came back online after 15 minutes downtime.

Due to a critical upgrade for software on two switches at the datacenter provider we use, we experienced an outage of 15 minutes.
This update was a critical update to fix multiple bugs within the switch, where the expected downtime would be less than 5 minutes because these updates happens in a rolling manner (so meaning 1 switch at a time), and usually so short that most people don't even notice.

Sadly the upgrade caused minor issues and extended the downtime to a total of 15 minutes, which surely affected the availability of the services behind these switches.

We're sorry about any issues you've experienced during the upgrade, but even with a fully redundant network sometimes network issues happen.
日期 - 17/06/2016 01:44 - 17/06/2016 01:59
最後更新 - 17/06/2016 02:13

DDoS mitigation test (已解決)

優先級 - 高
影響範圍伺服器 - PAR3
After we've deployed our Arbor DDoS mitigation solution for this server, we have to perform a test to make sure that every service works as expected.
This test might cause small outages of up to 5 minutes at a time.

To make the least impact we'll perform the this test starting at 10pm today.

What will be done is that we enable the mitigation for a period of 5 minutes, do tests if services such as email, and HTTP traffic works as expected.
If it doesn't, the mitigation will disable itself after 5 minutes.

Update 21:57:

We'll start mitigation test in a few minutes

Update 22:19:

Since services seems to be working fine - we'll be disabling the mitigation test in a few minutes.
A few minutes after the mitigation are turning off, services can reset it's connection.

-----

All timestamps are in Europe/Amsterdam
日期 - 08/06/2016 22:00 - 08/06/2016 22:05
最後更新 - 08/06/2016 22:24

Server under ddos (已解決)

優先級 - 中
影響範圍伺服器 - PAR3
The server3 / PAR3 is currently under a DDoS attack, this attack is being filtered by an anti-ddos solution.

Be aware that it can affect normal traffic as well, either sometimes rejecting or responding slowly.

We're working together with the datacenter to make as much clean traffic pass through to the server as possible.

update 20.10:

The server is still under attack, we've made a few changes which should result in more clean traffic going to the server, sadly it's still blocking about 20% of the traffic.
In periods the traffic reaches the server as expected, and other times it's insanely slow.

We've planned with the datacenter tomorrow morning to deploy additional protection measures. We do hope that the attacks will stop as soon as possible.

update 21.05:

The attack is still ongoing, but we've managed to restore full traffic to the server, every site should be working now.

update 03.43:

The attack stopped

update 07.48:

We've reverted our fix, and will continue to deploy Arbor protection today.
日期 - 07/06/2016 17:59
最後更新 - 08/06/2016 07:49

New backup server (已解決)

優先級 - 中
影響範圍系統 - Backup
We're setting up a new backup server for backups for every server we have, and because of this we'll have a period where backups will be taken on the new system but not on the old one.

We'll switch to the new system as soon as the first initial backup has been completed, if you need backups restored from older than today, please create a support ticket, and we'll be able to restore it until 14 days has passed.

Update April 16 at 4pm: All servers has been moved to the new backup server, the old server will be running for another month, which after we'll turn it off.
日期 - 15/04/2016 17:08 - 16/04/2016 16:00
最後更新 - 17/04/2016 17:56

PHP upgrade (已解決)

優先級 - 低
影響範圍其他 - AMS1, RBX2, PAR3
We'll be upgrading PHP versions to 5.6 on the servers, AMS1 (server1), RBX2 (server2a), and PAR3 (server3).

This upgrade will start around 10pm on march 12.

The plan for the maintenance is as following:

Server1 (Starting at 10pm):
- Upgrade PHP 5.4 to PHP 5.6
- Recompile a few PHP modules to be compatible with 5.6

Server2a:
- Upgrade PHP 5.4 to PHP 5.6
- Recompile a few PHP modules to be compatible with 5.6

Server3:
- Upgrade PHP 5.5 to PHP 5.6
- Recompile a few PHP modules to be compatible with 5.6

All times are in Europe/Amsterdam timezone.

Since we take one box at a time, we can't give specifics about when we'll proceed with Server2a and Server3, but do expect to be affected between 10pm and 2am.
We'll try to keep the downtime as low as possible.

In worst case if the upgrade of a specific server takes too long we might postpone one or more servers until either sunday or another date which would be announced in case of this.

Best Regards,
Hosting4Real

UPDATE 22:09:
We're starting the upgrade of server1.

UPDATE 22:22:
Server1 completed.
We'll proceed with server2a and server3.

UPDATE 22:36:
Server3 has been completed.
We're still waiting for server2a to complete downloading some updates.

UPDATE 23:12:
We had to stop the webserver temporary on server2a since it was overloading the system meanwhile doing the upgrade. This means you'll currently experience that your site is not accessible.

UPDATE 23:19:
Upgrade completed - we experienced 8 minutes of downtime on server2a.
日期 - 12/03/2016 22:00 - 12/03/2016 23:19
最後更新 - 12/03/2016 23:20

DC Network Upgrade - Server1 (已解決)

優先級 - 高
The data center will do a network maintenance on February 17th from 4AM to 8AM CET.
This maintenance will affect the connectivity to server1.hosting4real.net (AMS1) for up to 4 hours, but they try to keep it minimal.

This means that your websites will be affected during this period, the datacenter tries to keep the downtime as low as possible, the NOC post from the data center will be found below:

Dear Customer,

As part of our commitment to continually improve the level of service you receive from LeaseWeb we are informing you about this maintenance window.

We will upgrade the software on one of our AMS-01 distribution routers on Thursday 17-02-2016 between 04:00 and 08:00 CET. There will be an interruption in your services, however we will do our utmost best to keep it as low as possible.

Affected prefixes:
37.48.98.0/23
95.211.94.0/23
95.211.168.0/21
95.211.184.0/21
95.211.192.0/21
95.211.224.0/21
95.211.240.0/22
185.17.184.0/23
185.17.186.0/23
37.48.100.0/23
89.144.18.0/24
103.41.176.0/23
104.200.78.0/24
179.43.174.128/26
191.101.16.0/24
193.151.91.0/24
212.7.198.0/24
212.114.56.0/23
2001:1af8:4020::/44

We're working on a solution to prevent this in the future by moving the server to another data center to increase the stabiity and uptime of the network.
This new server will first be arriving in about 2 months, so we won't be able to make it before this maintenance is planned.

We're sorry for the inconvenience and downtime caused by this.

Be aware that this maintenance will also result in emails not being delivered to any hosting4real.net email address - if you want to create tickets during this timeframe, please create it directly at https://shop.hosting4real.net/

[UPDATE 04.00]
The maintenance has started. This will cause downtime on server1.
We'll update this when we have more news.

[UPDATE 07:37]
Server offline - this is due to a reboot of the distribution router.

[UPDATE 07.55]
The maintenance has been completed - there should be no more downtime due to this.

[UPDATE 09.06]
There's an issue after the maintenance which has caused the connectivity to disappear, a networking guy are on site to work on the problem.
We'll keep you updated.

[UPDATE 09:30]
The issue has been resolved - it was due to a hardware problem after the maintenance which was fixed.
日期 - 17/02/2016 04:00 - 17/02/2016 09:30
最後更新 - 17/02/2016 10:21

Reboot of server2a (已解決)

優先級 - 高
影響範圍伺服器 - RBX2
We have to do a reboot of server2a due to a bug in the backup software we use which prevents the server from being backed up.
We've investigated together with R1Soft how to resolve this, but since we haven't found a solution for it yet, and there's not ETA when it will be resolved, we have to do a reboot of the system to be able to back it up again.

We're sorry about the short notice, we'll do it at 3 AM at night since this is the time the server serves the least amount of traffic, to not affect too many customers, and also it allows us to get the server faster back online.

The reboot is expected to take 1-5 minutes, but in case of failure to boot we've put the timeframe of 30 minutes which should be enough to resolve possible boot errors.

We're sorry for the inconvenience.

UPDATE 03:04:
We start the reboot of the server

UPDATE 03:08:
Server back online
日期 - 07/01/2016 03:00 - 06/01/2016 03:30
最後更新 - 07/01/2016 03:10

Network connectivity (已解決)

優先級 - 重大
影響範圍伺服器 - RBX2
We're currently experiencing either full or high packet loss on server2a - this is due to two routers crashed, the datacenter has found the cause being related to IPv6 killing the routers due to a bug.
They're working as we speak resolving this but it might result in that we see connectivity returning and disappearing a few times.

We're sorry for the issues caused, and we'll update when we know more.

So far the routers run at 100% CPU usage - there's 75% connectivity again for both routers, the last 25% will be back online within short time.

UPDATE 5:55PM:
All network has been stabilized after monitoring for around 10-15 minutes.
The cause was an overload in the IPv6 configuration of the routers, there was made a quickfix by disabling IPv6 fully on the routers, and which resulted in the traffic slowly returning, after stabilizing, the IPv6 traffic was enabled again except some specific segment which seems to be the cause of the problem.

Since we do not run IPv6 on our servers we're currently not affected by this.
We're sorry for the issues caused by downtime you experienced, we do the best we can to ensure all our providers run a redundant setup, which is also the case here, it usually prevents a lot of issues, but in this case the bug caused both routers to crash at same time.
日期 - 04/01/2016 17:16
最後更新 - 04/01/2016 18:01

server2a blacklisted by Microsoft (已解決)

優先級 - 高
影響範圍伺服器 - RBX2
Hello,

We had an account that was defaced, which caused a so called "darkmailer" to be triggered. Usually we block these kinds of attacks on the mail server level, but since this bypasses the mail server, it allowed the server to send out around 400 emails before we stopped the attack.
This sadly caused a few triggers in some blacklists, some of them where we've already been delisted again based on our request.

We tested sending to major email service providers, and saw that Microsoft temporarily have blocked the email sending to their service.
This means currently all email towards Microsoft (including outlook.com, hotmail.com, live.com) is currently being blocked by Microsoft.
We've opened a case with Microsoft to get our IPs delisted from their service and awaiting response from their abuse team for this to happen.

We do not expect this to be an issue for too long.

We're sorry for the issues caused by this block, and we're working on alternative solutions meanwhile.

INFO 20.00:
We're currently ensuring that no emails are being sent out that could cause another block.

UPDATE 21.00:
Request sent to Microsoft to request delisting after confirming that the issue was resolved.

UPDATE 21:15:
Microsoft has removed the block from our IP, and emails should be sent as usual to Microsoft, with a few delays.
日期 - 25/09/2015 20:00 - 25/09/2015 21:15
最後更新 - 25/09/2015 21:24

Server2a Outage (已解決)

優先級 - 高
影響範圍伺服器 - RBX2
Today we had an outage on server2a, and we wanted to explain what happened.

One of our datacenter providers did a human error on a OSPF configuration which did cut off a router.
This usually isn't a problem except for some minor routing problems, or worst case that a small amount of traffic is impacted.
But the issue here was that some of the route reflectors didn't communicate that the router that was cut off was actually down, so the network saw the router as still active.

This resulted in bad routing in the network, which was leter fixed by taking all BGP sessions for that specific segment of the network.

Later they saw that there was a bug in one of the reflectors which is why the route reflectors didn't communicate in the first place.

They did reset the broken route reflector, which solved the problem with communication.

After this the traffic was enabled again and traffic started to flow as normal.

The reason why it resulted in connectivity issues to the server was because it did impact connectivity to Cogent, Tata, Level3 and Telia's network from the datacenter.
日期 - 29/07/2015 16:00 - 29/07/2015 16:10
最後更新 - 29/07/2015 20:49

Server2a Outage (已解決)

優先級 - 重大
影響範圍伺服器 - RBX2
We're currently experiencing an outage on server2a due to a webserver crash, it has caused high load on our system, and we're working as fast as possible to solve this issue.

Update 0015:
The machine is back online after forcing a reboot of the system.
The system that we use to force the reboot seemed to have some issues meaning we needed to call the technical support department in the datacenter to force the reboot from their end.
Meaning it took a bit longer.

We're currently investigating what went wrong, and how we can prevent it in the future.
Until then we'll leave this ticket open.

For the next hours the server will have longer response times due to we need to ensure all data on the system is intact.

We're sorry for the issues caused.
日期 - 04/06/2015 23:56 - 05/06/2015 00:15
最後更新 - 19/06/2015 11:20

[Cloud] Frankfurt-A issues (已解決)

優先級 - 高
We're currently experincing issues with our Frankfurt-A POP for some Cloud servers.

The issue is getting worked on, there's no current ETA when this will be fixed.

The issue was resolved at 13:00
日期 - 09/06/2015 12:10 - 09/06/2015 13:00
最後更新 - 19/06/2015 11:20

Updating SSL certificates (已解決)

優先級 - 低
影響範圍系統 - All servers
We're currently updating all services using the *.hosting4real.net SSL since it's about to expire, this means you might need to accept the new certificate if you force TLS over your own domain.
[UPDATE 07:42PM]
All certificates has been updated.
日期 - 14/05/2015 19:07 - 14/05/2015 19:42
最後更新 - 14/05/2015 19:45

Downtime of server2a (已解決)

優先級 - 重大
影響範圍伺服器 - RBX2
This morning at 6AM we had downtime on Server2a (RBX2), this was due to a kernel panic of the machine.

We're working on finding the root cause of this kernel panic.
We rebooted the system to get back online, and after around 10 minutes of downtime, all our checks reported up again.

The server is currently resyncing the whole raid on the server to ensure data integrity, this causes a bit higher load on the system than usual. Due to the size of our disks, this will take quite some time, but we'll try speeding up the process by putting higher priority without affecting performance too much.

If you should experience any problems with your website. Please contact us at support@hosting4real.net.

[UPDATE 1:30PM]
We had a permission issue with a few files that prevented some customers uploading files to a range of CMS's based on how the CMS would handle file uploads.
This was reported by a customer, and got fixed.

[UPDATE 9:19PM]
There's still around 70-110 minutes left of the raid synchronization, we decided to let the syncing work most of the day, to affect performance the least, but since the traffic on the server is quite low at this point we increased the synchronization speed to stay around 70-90 megabyte per second whenever possible. We still have another 400 gigabyte of disk to verify.
We still give the process low priority meaning all other processes still get highest priority and might affect the synchronization time.

We'll update as when the synchronization finishes.

[UPDATE 11.47PM]
The raid synchronization has now finished, and performance is 100% back to normal.

The only last step, is that our backup system will need to check for the backup integrity as well, meaning it will also need to scan the data on the disks, this takes a few hours, and it will start at 4am.

[UPDATE 7:49AM]
Backup finished as normal.

This means everything is back to normal.

Best Regards,
Hosting4Real
日期 - 13/05/2015 06:00 - 15/05/2015 07:49
最後更新 - 14/05/2015 08:00

TDC having connection issues. (已解決)

優先級 - 高
影響範圍系統 - Network
Some danish customers might experience connectivity issues to our servers, and other hosting providers throughout Denmark and the rest of the world, due to an outage at TDC.

ISP's is doing their best to route traffic via other providers, but be aware that all customers located on the TDC network might experience complete outage to a lot of websites.

We're sorry for the inconvenience.

UPDATE 16:13:

Seems like TDC has resolved their issues.
日期 - 22/04/2015 15:07 - 22/04/2015 16:13
最後更新 - 22/04/2015 16:14

Reboot of all servers (已解決)

優先級 - 重大
影響範圍系統 - All servers
Due to a major security issue with glibc on Linux, we're required to reboot all servers in our infrastructure.
This is expected to only take 5-10 minutes per server, but we've put a maintenance window of 2 hours in case of problems.

We're sorry for the short notice, but the update was first available this morning, and it requires rebooting all systems.

We'll start with server1, at 09.30, when it's back up, we will proceed to server2a, and after this the remaining nameservers and backup servers which doesn't affect customers directly.

We're very sorry for the inconvenience.

- Hosting4Real

UPDATE 09:35:
server1 was rebooted, had 239 seconds of downtime.

Proceeding with server2a.

UPDATE 09:41:
Server2 was rebooted, and had 173 seconds of downtime.

We will proceed with nameservers, backup servers, etc.
These won't have direct impact on customers. We will keep you updated.

UPDATE 09:53:
mx2 (ns3) was rebooted, and had 92 seconds of downtime.

UPDATE 09:54:
ns4 was rebooted and had 14 seconds of downtime.
backup server was rebooted and had 191 seconds of downtime.

Updating is done for today.
日期 - 28/01/2015 09:30 - 28/01/2015 09:54
最後更新 - 28/01/2015 10:25

POODLE SSL attack (已解決)

優先級 - 高
影響範圍系統 - All servers
Currently people might be hearing about the POODLE SSL attack, which affects SSLv3.

The CVE attached to this attack is CVE-2014-3566. We've decided to disable SSLv3 on all our servers, and only allow TLS1, 1.1 and 1.2 which doesn't have these issues.

Dropping SSLv3 also means that IE6 users and really really old browsers cannot visit any sites using SSL on our servers anymore. But due to the age and amount of traffic seen from these browsers we don't see it as a problem.

If you have any questions, please contact support@hosting4real.net.

Best Regards,
Hosting4Real
日期 - 15/10/2014 22:00 - 16/10/2014 22:30
最後更新 - 09/11/2014 00:48

Recreation of Backups for server2a (已解決)

優先級 - 高
影響範圍系統 - Backup server2a
Due to storage amounts for the disk safe of server2a, we need to recreate this disk safe.
This means that we'll delete the old disk safe including backups, and create a new one, and start it immediately, we also store one weekly backup on other servers, so if any backup recovery is needed, please contact our support for this matter.

Sorry for the inconvenience.

[UPDATE 08:40]
The disk safe has been recreated, and we've now queued the server for backup.

[UPDATE 12:04]
First few backups are now done, and it will keep taking backup as usual.
日期 - 08/10/2014 07:47 - 08/10/2014 12:04
最後更新 - 08/10/2014 19:40

NL network maintenance (已解決)

優先級 - 高
Our datacenter provider will do maintenance on their network in AMS-01 datacenter. This maintenance will impact IP connectivity, they'll be working on multiple routers and expect that each segment can be affected for up to two hours.
There will be a service unavailability. While connectivity is restored you might also experience higher latency and/or packet loss.

The IPs affected by this maintenance is:
```
91.215.156.0/23
91.215.158.0/23
95.211.94.0/23
95.211.168.0/21
95.211.184.0/21
95.211.192.0/21
95.211.224.0/21
95.211.240.0/22
179.43.174.128/26
185.17.184.0/23
185.17.186.0/23
188.0.225.0/24
192.162.136.0/23
192.162.139.0/24
193.151.91.0/24
212.7.198.0/24
212.114.56.0/23

2001:1af8:4020::/44
```
This will affect our connectivity as well for Server1.

Sorry for the inconvenience.

UPDATE 03-09-2014 18:01:

We received updates from the datacenter, we've been informed that the expected downtime should be no longer than 30 minutes.
What is going to happen is that all IPs are undergoing an AS change and therefore the BGP AS Number needs to be changed for the routers.
Due to the nature of how the internet works, multiple routers need to pick these changes up, resulting in connectivity issues.

UPDATE 16-09-2014 08:05:

The maintenance is still ongoing, we're awaiting update from LeaseWeb with more information, sorry for the inconvenience.
We see pings going through from time to time, so network is getting restored at this very moment.

Sincerely,
Hosting4Real
日期 - 16/09/2014 06:00 - 16/09/2014 08:07
最後更新 - 17/09/2014 08:30

IPs on server1 blacklisted in spamhaus (已解決)

優先級 - 重大
The IPs 95.211.199.40, 41, 42, which is our main site, and server1 is currently blacklisted in the spamhaus SBL blacklist.

This is due to the IPs is a part of the subnet 95.211.192.0/20 owned by LeaseWeb has been blacklisted.
We're awaiting reply from LeaseWeb for a status update and an estimated resolution time.

You can find more about the blacklisting here: http://www.spamhaus.org/sbl/query/SBL230484

Due to the good reputation of our IPs assigned to us, this blacklisting will only affect our customers if ISPs is fully relying on spamhaus for their spam filtering.

We're truly sorry for the inconvenience caused by this issue.

UPDATE 13/08/2014 22:16
We've received the notification from LeaseWeb that their abuse team is aware of this issue and is awaiting reply from Spamhaus.

UPDATE 15/08/2014 09:47
The IP subnet is now removed from the blacklist at spamhaus.
We haven't received any bounces about emails, so no problems caused for any of our customers.
日期 - 13/08/2014 21:23 - 15/08/2014 09:47
最後更新 - 15/08/2014 16:32

VPS management down (已解決)

優先級 - 高
影響範圍系統 - WHMCS
We're currently experiencing some issues getting the correct data for customer VPS's, this means that you might currently not see your VPS in the control panel.
We're working on a fix.

UPDATE 17/07/2014 19:47:
All issues should be resolved, meaning all nodes should be visible to customers.

One thing that you might notice is that there has been some changes to how your VMs is shown in your product list, this change was done since some customers found it confusing that all VMs existing was stored under same product.
This means that now you'll see your 'cloud nodes' - this is the main product that you pay for on a monthly basis.

The other products you'll see is the actual VPS servers you've deployed, these will have a price of 0, since you already pay for the actual resources from the main product.

From here you'll be able to manage your VPS, as well as see the amount of storage, compute and memory resources you've allocated as well as the location and operating system.
日期 - 16/07/2014 11:32 - 17/07/2014 19:47
最後更新 - 17/07/2014 19:52

Flytning af kunder Del 1 og 2 (已解決)

優先級 - 中
Vi foretager flytning af en række kunder fra Server2 (RBX1), til en ny maskine for at forbedre hastigheden endnu mere med nyt hardware og netværk.

Under flytningen vil der kunne komme nogle udfald, og derfor bedes de informeret kunder ikke foretage for mange ændringer på deres webhoteller i perioden fra kl 21.00 til 07.00.
Der vil blive sendt email ud ved hver webhotel der bliver flyttet, og når webhotellet er flyttet, så vi holder dig løbende opdateret om status på din flytning.

UPDATE 23.17:
Vi har flyttet størstedelen af kunderne fra gammel til ny server, og er færdig for denne weekend.

Vi vil forsætte i næste weekend med endnu flere flytninger, evt allerede i løbet af ugen, vi informere løbende de eksisterende kunder inden de flyttes.

UPDATE d. 4 Juli 21.23:
Alle kunder er flyttet fra gamle server2 til den nye server2a, den gamle server vil blive slukket d. 20 juli, og herefter vil vi lave skiftet af hostname fra server2a tilbage til server2.
日期 - 27/06/2014 21:00 - 28/06/2014 07:00
最後更新 - 06/07/2014 22:38

Server2 down (已解決)

優先級 - 重大
Server2 is current down due to hardware failure. The defect part is currently being replaced.
Sorry for the incovenience.

Server2 er nuværende nede grundet af defekt hardware. Delen er ved at blive skiftet.
Vi undskylder meget.

UPDATE 09.07:
Server2 is back online. We had some issues with some customer MySQL databases due to the way MySQL crashed, so some databases became corrupt. MySQL made a recovery by itself from the InnoDB log files.
The system is slower than usual, because we're resyncing all data in the raid to ensure no data is lost.

Server2 er tilbage online. Vi havde nogle fejl på nogle kunders MySQL databaser grundet af måden MySQL serveren stoppede på som gjorde nogle filer blev korrupte. MySQL lavede en automatisk gendannelse af disse databaser ud fra InnoDB log filerne.
Systemet er langsommere end normalt, da vi synkronisere alt data i vores raid for at sikre at data er intakt.

UPDATE 11.00:
The server was down again, we made a reboot and was back online after about 4 minutes.
We've investigated the issue and the problem should be resolved.
We'll monitor the server extensively the rest of the day.
Sorry for the incovenience.

Serveren var nede igen, vi lavede en genstart og var oppe 4 minutter efter.
Vi har undersøgt problemet, og det burde være løst nu.
Vi holder tæt øje med serveren resten af dagen.
日期 - 03/03/2014 07:20 - 03/03/2014 11:12
最後更新 - 06/03/2014 10:21

Server updates (已解決)

優先級 - 高
影響範圍系統 - All servers
Back in January we was performing upgrades, which was cancelled due to some issues with one of our servers.

We'll be performing these upgrades which will require all servers to restart, estimated downtime per server will be approx. 5-10 minutes.
We apologies for any inconvenience.

We'll do our best to keep the downtime as low as possible.

Due to unforseen circumstances, we've decided to postpone the upgrade until next week. We're sorry about the short notice.

Tilbage i Januar lavede vi nogle opdateringer, som blev annulleret grundet nogle fejl med en af vores servere.

Vi vil lave disse opdateringer, hvilket kræver at vi genstarter alle servere, den forventet nedetid per server er ca. 5-10 minutter.
Vi undskylder for besværet.

Vi vil gøre vores bedste for at minske nedetiden så vidt muligt.

Grundet af uforudsete grunde, har vi valgt at udskyde opdateringen til næste uge. Vi undskylder den korte udmelding.
日期 - 07/02/2014 22:00 - 07/02/2014 23:59
最後更新 - 07/02/2014 13:36

Server updates (已解決)

優先級 - 高
影響範圍系統 - All servers
During this time we'll be upgrading all our servers with the latest security patches for all services.

This requires that we do a restart of all servers, the estimated downtime for each server is around 5 minutes if everything goes as planned.

This affects all of our infrastructure, except VPS customers.
We'll keep all clients informed when we begin the update on twitter, as well as when we finish, or any issues that might occur during this 2 hours timeframe.
We try to keep the downtime as low as possible.

----------

Vi vil opdatere vores servere med de seneste sikkerhedsopdateringer for alle vores services.

Dette kræver at vi gestarter alle servere, og den forventet nedetid for hver server vil være ca. 5 minutter hvis alt går som planlagt.

Dette pårøre hele vores infrastruktur, undtaget VPS kunder.
Vi vil holde alle kunder informeret når vi begynder opdateringen via twitter, og igen når vi er færdige, eller hvis der skulle opstå nogle problemer i dette 2 timers tidsrum.
Vi vil holde nedetiden så lav så mulig.

UPDATE 1:
Under opdatering af server2, skete der en fejl i vores load af kernel.
Den er ved at genstarte igen.

UPDATE 2 (00.59):
Grundet af større nedetid på server2, har vi valgt at stoppe opdateringen af resten af vores servere, vi udskyder opdateringen for alle vores services. Grundet den lange nedetid, vælger vi at forlænge alle webhoteller på server2 med 1 måned gratis.

Vi vil skrive en blogpost imorgen omkring de problemer vi har oplevet her til aften.

Vi undskylder rigtig mange gange, for det ekstra nedetid der blev forudsaget.
日期 - 27/12/2013 21:59 - 28/12/2013 00:58
最後更新 - 05/02/2014 21:01

Opgradering af nginx (已解決)

優先級 - 重大
Dansk:
Vi bliver nød til at lave en opdatering af nginx på server 2, dette bliver gjordt grundet af en fejl i den nuværende version, som er blevet fixet i den nye.
Vi undskylder mange gange for den korte udmelding.

English:
We need to perform an upgrade of nginx on server 2, We've found a small issue in current version which is fixed in the new one.
We're sorry for the short notice.

- Hosting4Real
日期 - 02/10/2013 22:30 - 02/10/2013 22:59
最後更新 - 03/10/2013 00:41

Upgrade of mysql from 5.1 to 5.5 (已解決)

優先級 - 高
English:
We'll be upgrading our MySQL version from 5.1 to 5.5
This upgrade will improve the performance of MySQL on server2.

We've planned 2 hours for the upgrade, and should be enough to perform the upgrade.
We'll backup all databases before we start, and we might run into either small timeouts or downtimes of some websites, but we will try to keep it at the very minimal.

This is a part of our normal service upgrades, and it's a must that we do this.

Danish:
Vi vil opgradere vores MySQL version fra 5.1 til 5.5
Denne opgradering vil øge hastigheden af MySQL på server2.

Vi har planlagt 2 timer for denne opgradering og det skulle bære nok.
Vi vil tage backup af alle databaser før vi starter, og vi kan løbe ind i små udfald af nogle websites under opgraderingen, men vi vil prøve at holde nedetiden så lille så mulig.

Dette er en del af normal service opdatering, og det er et must at vi gør dette.

Best regards/Venlig Hilsen

Hosting4Real

UPDATE 22.11:
Alt gik som planlagt - Everything went as expected.
日期 - 07/09/2013 22:00 - 07/09/2013 22:11
最後更新 - 07/09/2013 22:16

Server2 network (已解決)

優先級 - 高
Danish

Der var i nat kl 01.09 et udfald på server2 (RBX1). Problemet opstod i at alle Routing Reflectors crashede samtidig, datacenteret har arbejdet på at finde grunden, som test blev nogle reflectors nedgraderet til en tidligere version af det software de bruger. Dog var problemet der stadig.
Disse udfald ses meget sjældent, da alle route reflectors er redundant, hvilket vil sige at 1 reflector kan gå ned, uden at skabe nedetid.

Det pårørte ikke kun vores segment, da det var globalt på netværket, dvs. det pårørte i alt 7 datacentre (3 forskellige placeringer). Der er i alt 10 route reflectors i disse datacentre, hvor alle gik ned samtidig. Datacenteret arbejder sammen med Cisco om at finde kerne fejlen. Fejlen vil blive ændret så snart en løsning er fundet.

- Hosting4Real

English

Tonight at 01.09pm CEST, we saw a small outage of 7-10 minutes on server2 (RBX1). The problem was that all route reflectors crashed at same time. The datacenter is working on finding a solution, as a test some of the reflectors was downgraded to a earlier version of the software. This didn't solve the problem.
These outages is very rare, due to a redundant system, that allow 1 reflector to go down, and another one will still be able to handle the routing, without causing any downtime.

This problem existed on a global network, in total of 7 datacenters (in 3 different locations). Those 7 datacenters have a total amount of 10 route reflectors, which all went down at almost same time. The datacenter is working together with Cisco to find the root cause of this problem, and will fix it as soon as possible.

- Hosting4Real
日期 - 18/07/2013 01:09 - 01/08/2013 00:00
最後更新 - 05/08/2013 07:36

Upgrade of nginx (已解決)

優先級 - 重大
English:
We'll be changing the webserver of RBX1 (Server2) this evening due to a upcoming release of cpanel.
What we will do is to remove nginx until the new cpanel have been upgraded, and we've tested the webserver properly.

When we do the uninstall, all websites will become unavailable for a very short time, since we need to stop nginx, and start Apache up again.

Sorry for this short notice, but it was first announced this morning that cpanel 11.38 will be released within' a few days, which can be today. So we wan't to make sure that everything works.

- Hosting4Real

Danish:
Vi vil ændre webserveren på RBX1 (Server2) her til aften, grundet af en opdatering til cpanel der vil komme inden for de næste par dage.
Hvad vi vil gøre er at fjerne nginx indtil den nye opdatering er ude, og at vi har testet nginx, og er sikker på den virker 100% med den nye version.

Når vi fjerner nginx, betyder det at siderne ikke kan tilgås i meget kort tid, grundet af vi skal stoppe nginx, og starte Apache op igen.

Vi undskylder for den korte varsel, da det først var annonceret her til morgen at cpanel 11.38 vil blive udgivet inden for de næste få dage, hvilket allerede kan være i nat. Så vi vil være sikker på at alt virker.

- Hosting4Real

UPDATE 1:
We've successfully made the change, and everything is working. We'll get nginx back on the server when we've tested it properly with the new release.

UPDATE 2:
Nginx is up and running again, it caused 1-2 minutes downtime for a few people, due to a small mistake in the nginx configuration file, with log formatting.
日期 - 24/05/2013 21:00 - 24/05/2013 21:02
最後更新 - 15/06/2013 15:16

Upgrade of nginx and php (已解決)

優先級 - 中
影響範圍系統 - All servers
We'll be software upgrades affecting PHP and NGINX.
This upgrade have been tested, in our development environment, and resulted in 3 seconds downtime, We'll try to keep it at same amount if not able to make no downtime at all.

During the upgrade of PHP, the Memcache, htscanner and new relic modules will be removed for short amount of time. This means if you use any of these modules, your code might be returning some errors during the upgrade.
If you're making use of htscanner, please make sure you use the in your htaccess files, else this will result in internal server error for sites until it's recompiled (2-3 minutes).

We've set the maintenance to two hours. Downtime will be minimal.

- Hosting4Real

UPDATE 1:
Server1 is now fully updated, next server is going to be server2.

UPDATE 2:
All servers is now up to date, everything went as it should.

Thanks you for your patience.
日期 - 17/05/2013 22:59 - 18/05/2013 00:01
最後更新 - 18/05/2013 00:05

High packet loss on RBX1 (Server2) (已解決)

優先級 - 重大
We're currently seeing a high packet loss, on server2, this is affecting all IPs: 94.23.28.169, 94.23.147.217 and 94.23.148.5.

We've narrowed the problem down, to be a problem with the part of the network called vss-1-6k, and also eur-1-1c. The packet loss is between 53% and 98% currently, this means that sites might be accessible in small periods.

Sorry for the downtime it causes, we will try to get things back to normal as fast as possible.

UPDATE:

The network is back to normal.
日期 - 17/05/2013 04:35 - 17/06/2013 06:19
最後更新 - 17/05/2013 06:35

Udfald i 6-7 minutter på RBX1 (已解決)

優先級 - 重大
Dansk:
Der var et mindre udfald på RBX1 i en kort periode af 6-7 minutter, fejlen opstod ved generering af vhosts, som gik galt, vi er ved at undersøge grunden til dette og vil prøve at forhindre det i fremtiden. Vi undskylder.

English:
We had a small outage on RBX1 for a short period of 6-7 minutes, the error occurred when generating vhosts, we're looking into why this did happen, and will do our best to prevent it in the future. We're sorry for the small outage.
日期 - 13/02/2013 17:24 - 13/02/2013 17:52
最後更新 - 13/02/2013 17:51

Opdatering af software (已解決)

優先級 - 低
Dansk:
Vi har en lille opdatering af noget software på AMS1, dette gør at der kan være mulighed for lidt udfald i få minutter. Vi vil gøre vores bedste for dette ikke opstår.

English:
We're making a small update of some software on AMS1, which means that there can be some downtime for few minutes, we'll do our best to not make this happen.

---------
Hosting4Real
日期 - 08/02/2013 23:50 - 08/02/2013 23:59
最後更新 - 09/02/2013 00:08

Cpanel not accessible (已解決)

優先級 - 中
All customers located on AMS1 (Server1) will not be able to log into cPanel currently, We found out, that the renew of the license didn't work as expected, this should be working again soon.

Sorry for the time, not being able to log in.

Email, and websites will keep function as normal.

UPDATE:
The problem is fixed. Everything should now function normally.
日期 - 02/02/2013 12:16 - 02/02/2013 14:24
最後更新 - 02/02/2013 14:24

Server Maintenance (已解決)

優先級 - 低
The night between December 14 and December 15, we'll have a server maintenance - this requires that we power of the affected server, our estimated time is 2 hours, but we've reserved 4 hours for doing this maintenance.
During this time, services will not be available.

We're sorry for the downtime it may cause, but we'll do our best to keep it as short as possible!
日期 - 14/12/2012 23:59 - 15/12/2012 03:29
最後更新 - 04/12/2012 15:53

Expecting 10 minutes of downtime for VPS's (已解決)

優先級 - 中
影響範圍其他 - Network
The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Amsterdam is expected to be down for 10 minutes.
日期 - 14/11/2012 20:00 - 14/11/2012 20:10
最後更新 - 17/11/2012 16:12

Expecting 10 minutes of downtime for VPS's (已解決)

優先級 - 中
影響範圍其他 - Network
The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Frankfurt is expected to be down for 10 minutes.
日期 - 13/11/2012 20:00 - 13/11/2012 20:10
最後更新 - 17/11/2012 16:12

Expecting 10 minutes of downtime for VPS's (已解決)

優先級 - 中
影響範圍其他 - Network
The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Paris is expected to be down for 10 minutes.
日期 - 12/11/2012 20:00 - 12/11/2012 20:10
最後更新 - 17/11/2012 16:12

服務狀態例行性維護、排定維護、臨時維護...

查看

404 page after restore of subdomains or addon domains

Database restoration requires user restore

技術支援

服務狀態 例行性維護、排定維護、臨時維護...

查看

404 page after restore of subdomains or addon domains

Database restoration requires user restore

技術支援

產生密碼

服務狀態例行性維護、排定維護、臨時維護...