CN114546644B

CN114546644B - Cluster resource scheduling method, device, software program, electronic device and storage medium

Info

Publication number: CN114546644B
Application number: CN202210144907.XA
Authority: CN
Inventors: 方睿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2025-06-13
Anticipated expiration: 2042-02-17
Also published as: CN114546644A

Abstract

The present invention provides a cluster resource scheduling method, device, software program, electronic device and storage medium, the method comprising: configuring a workload in a cluster resource scheduling environment, and determining a timeout queue that carries the workload; when the workload in the timeout queue reaches a timeout state, the controller component adjusts the state of the workload to a secondary scheduling state; based on the workload information, a failed copy number detection request is sent to the corresponding estimator component; the estimator component responds to the failed copy number detection request to determine the number of copies that failed to schedule, and the scheduler component executes the cluster resource scheduling program based on the number of copies that failed to schedule. In this way, the availability of the workload can be guaranteed, the accuracy and reliability of cluster resource scheduling can be improved, the utilization efficiency of cluster resources can be improved, and the data processing speed of cloud server users can be guaranteed. The embodiments of the present invention can be applied to various scenarios such as cloud technology, artificial intelligence, smart transportation, and assisted driving.

Description

Cluster resource scheduling method, device, software program, electronic equipment and storage medium

Technical Field

The present invention relates to a cluster resource scheduling technology of a cloud network, and in particular, to a cluster resource scheduling method, a device, a software program, an electronic device, and a storage medium.

Background

With the continuous development of computer technology, the cloud server (CVM Cloud Virtual Machine) can provide safe and reliable elastic computing services and can also provide different instance types to meet the requirement of a user-specific use scenario. The instance types are composed of different combinations of a CPU, a memory, a storage and a network, and when the running process of the cloud server carries out cluster resource scheduling, the speed of the cluster resource scheduling directly influences the resource utilization rate and the user experience of the cloud data center. The scheduling priority guarantees that the user can allocate the resource and then optimally allocate the resource, namely, the resource utilization rate is improved. However, in the related technology, the situation that the node is out abnormally and the copy does not have resources to schedule still possibly occurs in the running process of the cluster, meanwhile, for the environment of multiple clusters, the problem of resource competition of the sub-clusters occurs when task processing is performed, and the accuracy and reliability of cluster resource scheduling are affected.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method, an apparatus, a software program, an electronic device, and a storage medium for scheduling cluster resources, which can execute a cluster resource scheduling program based on the number of copies failed in scheduling, so as to achieve the purpose of utilizing the maximum available number of copies in a cluster resource scheduling environment, ensure availability of a workload, and simultaneously promote accuracy and reliability of cluster resource scheduling, promote availability of cluster resources, ensure data processing speed of cloud server users, and promote user experience.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a cluster resource scheduling method, which comprises the following steps:

configuring a workload in a cluster resource scheduling environment, and determining a timeout queue carrying the workload;

when the workload in the timeout queue reaches a timeout state, the controller component adjusts the state of the workload to a secondary scheduling state;

When the scheduler component determines that the state of the workload is a secondary scheduling state, sending a failure copy number detection request to a corresponding estimator component based on the information of the workload;

The estimator component responds to the failed copy number detection request, determines the copy number with failed scheduling, and sends the copy number with failed scheduling to the scheduler component;

The scheduler component executes a cluster resource scheduler to implement utilizing a maximum number of available copies in the cluster resource scheduling environment based on the number of copies that failed to schedule.

The embodiment of the invention also provides a cluster resource scheduling device, which comprises:

an information transmission device, configured to configure a workload in a cluster resource scheduling environment, and determine a timeout queue carrying the workload;

Information processing means for adjusting the state of the workload to a secondary scheduling state by the controller component when the workload in the timeout queue reaches a timeout state;

the information processing device is used for sending a failure copy number detection request to a corresponding estimator component based on the information of the workload when the scheduler component determines that the state of the workload is a secondary scheduling state;

The information processing device is used for responding to the failed copy number detection request by the estimator component, determining the copy number with scheduling failure, and sending the copy number with scheduling failure to the scheduler component;

the information processing apparatus is configured to execute a cluster resource scheduler to implement utilization of a maximum available copy number in the cluster resource scheduling environment, based on the number of copies failed in scheduling by the scheduler component.

In the above solution, the information processing apparatus is configured to determine, by the controller component, an expected number of copies of the clustered resource scheduling environment;

The information processing device is used for the controller component to detect the workload in real time based on the expected copy number;

The information processing device is used for adjusting the workload to the timeout queue when the number of copies in the workload is smaller than the expected number of copies.

In the above aspect, the information processing apparatus is configured to detect, when a copy number of the workload in the timeout queue changes, the workload in the timeout queue based on an expected copy number by the controller component;

The information processing apparatus is configured to hold the workload in the timeout queue when the number of copies in the workload is smaller than the desired number of copies;

the information processing device is used for deleting the workload in the overtime queue when the number of copies in the workload is greater than or equal to the expected number of copies.

In the above scheme, the information processing device is configured to obtain node information and container group information of all nodes in a sub-cluster in the cluster resource scheduling environment by using the estimator component;

The information processing device is used for responding to the failed copy number detection request by the estimator component, inquiring the container group associated with the working copy in the sub-cluster, and determining a container group list corresponding to the container group;

the information processing device is used for inquiring the container group with failed dispatching from the container group list by the estimator component and calculating the copy number with failed dispatching according to the container group with failed dispatching.

In the above solution, the information processing apparatus is configured to determine a copy controller object list corresponding to the working copy when the type of the working copy is a resource type;

The information processing device is used for searching a container group list associated with the working copy through the cache of the copy controller object list;

And the information processing device is used for searching a container group list associated with the working copy in a cache of the working copy when the type of the working copy is a state copy set type.

In the above solution, the information processing apparatus is configured to, when the estimator component is started, obtain node information and container group information of all nodes in a sub-cluster in a cluster resource scheduling environment;

the information processing device is used for responding to the maximum available copy number estimation request by the estimator component, and screening nodes matched with the workload from all nodes in the cluster resource;

The information processing device is used for determining container group information corresponding to each node matched with the workload, and determining the maximum available copy number of each node matched with the workload based on the container group information;

The information processing device is used for determining the maximum available copy number in the cluster resource scheduling environment based on the maximum available copy number of each node matched with the workload.

The embodiment of the invention also provides electronic equipment, which comprises:

A memory for storing executable instructions;

And the processor is used for realizing the preface cluster resource scheduling method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer readable storage medium which stores executable instructions, wherein the executable instructions realize the method for scheduling the cluster resources when being executed by a processor.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the cluster resource scheduling method provided by the embodiment of the application.

The embodiment of the invention has the following beneficial effects:

The embodiment of the invention configures a workload in a cluster resource scheduling environment and determines a timeout queue carrying the workload, when the workload in the timeout queue reaches the timeout state, a controller component adjusts the state of the workload into a secondary scheduling state, when the scheduler component determines that the state of the workload is the secondary scheduling state, based on information of the workload, a failure copy number detection request is sent to a corresponding estimator component, the estimator component responds to the failure copy number detection request, determines the copy number of scheduling failure and sends the copy number of scheduling failure to the scheduler component, and based on the copy number of scheduling failure, the scheduler component executes a cluster resource scheduling program to realize the utilization of the maximum available copy number in the cluster resource scheduling environment.

Drawings

Fig. 1 is a schematic diagram of a usage scenario of a cluster resource scheduling method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a composition structure of an electronic device according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for scheduling cluster resources according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a cluster resource scheduling device according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of an alternative method for scheduling cluster resources according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of an alternative method for scheduling cluster resources according to an embodiment of the present invention;

Fig. 7 is an alternative flowchart of a cluster resource scheduling method according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) In response to a condition or state representing a dependency of an operation performed, the one or more operations performed may be in real-time or with a set delay when the dependency is satisfied, and without any particular limitation to execution sequencing.

2) Terminals, including but not limited to, ordinary terminals that maintain long and/or short connections with a transmission channel, and dedicated terminals that maintain long connections with the transmission channel.

3) The client, the carrier for realizing the specific function in the terminal, such as the mobile client (APP), is the carrier for realizing the specific function in the mobile terminal, such as the function of executing report making or the function of displaying report.

4) Components (components), which are functional modules of the view of the applet, also called front-end components, buttons, headings, tables, sidebars, content and footers in the page, etc., include modular code to facilitate reuse in different pages of the applet.

5) Server cluster (Server cluste) refers to a collection of servers that together perform the same service, and appears to the client as if there is only one server. The server cluster can use a plurality of computers to perform parallel computation so as to obtain high computation speed, and can also use a plurality of computers to perform backup, so that any machine breaks the whole system or can normally operate. The server cluster hard disk fault processing method provided by the application can be applied to cloud server use scenes and distributed server use scenes to realize state detection and fault restoration of the server hard disk in different use scenes, and particularly, the cloud server (CVM Cloud Virtual Machine) is a simple, efficient, safe and reliable computing service with elastically scalable processing capacity. The management mode is simpler and more efficient than the traditional single physical server. The user can quickly create or release any plurality of cloud servers for the business process of the user without purchasing hardware in advance, and store the data of the cloud server user. The data and the program of the user in the use environment of the distributed server can be distributed into a plurality of servers instead of being located on one server, and similarly, the use environment of the distributed server also needs to be configured with a large number of hard disks, and the state detection and the fault repair of the hard disks of the server are also needed to be realized by the fault processing method of the hard disks of the server cluster.

6) The container cluster management system Kubernetes, which can be called K8S, is an open-source container operation platform, can realize the functions of combining a plurality of containers into one service, dynamically distributing the host machines for container operation and the like, and provides great convenience for users to use the containers. The application can be rapidly deployed, rapidly expanded, seamlessly docked with new application functions and the use of hardware resources can be optimized through the Kubernetes.

Nodes are the basic elements of a container cluster. The nodes depend on the traffic, and can be virtual machines or physical machines. Each node contains the basic components required to run the container group Pod, including Kubelet (container management component), kubeproxy (network proxy component), etc.

The Master node (Master node) refers to a cluster control node, which manages and controls the entire cluster, and to which all control commands of k8s are issued, which is responsible for the specific execution process. Kube-apiserver (resource access component), kube-controller-mansger (operations management controller component) and kube-scheduler (scheduling component) running on the Master Node maintain healthy operating state of the entire cluster by constantly communicating with kubelet and kube-proxy on the working Node (Node). If the service of the Master Node cannot access a certain Node, the Node is marked as unavailable, and a newly built Pod (container group) is not scheduled to the Node. However, additional detection is required for the Master itself, so that the Master is not a single failure point of the cluster, and therefore high availability deployment is also required for Master services.

Nodes other than a Master are referred to as Node or Worker nodes (working nodes) and Node nodes in the cluster can be viewed in the Master using a Node view command (kubectl get nodes). Each Node is assigned with some workload (Docker container) by the Master Node, and when a Node is down, the workload on the Node is automatically transferred to other nodes by the Master Node.

Pod (container group) kubernetes creates or deploys the smallest/simplest basic unit-container group, one Pod represents a micro-service process running on the cluster, and one micro-service process encapsulates an edge container (there may also be multiple edge containers) that provides micro-service applications, storage resources, an independent network IP, and policy options that govern how the containers run.

7) Workload-a workload is a type of application that may contain multiple instances of copies.

8) Copies-instance units of workload, each copy instance being a separate container.

9) And secondary scheduling, which is used for reallocating cluster resources in the task processing process and adapting to scheduling of task demands.

Before introducing the cluster resource scheduling method provided by the application, firstly, the defects in the related art are briefly described, and in the related art, the following modes are generally used when the resource scheduling of the cloud network is carried out:

1) And detecting nodes with the resource utilization rate of more than 90% as heavy-load copies through a heavy-load node detection step, and then scheduling and migrating the copies, so that the load balance of the whole storage cluster node can be finally realized. The method has the defect that the method is only suitable for single clusters and cannot be expanded to multi-cluster use.

2) Firstly, receiving an application container deployment instruction and cluster resource use information uploaded by a federal cluster, which are sent by a user, then determining the deployment copy number of each sub-cluster in the federal cluster through the total copy number of an application template in the application container deployment instruction and the cluster resource use information, and finally realizing copy scheduling considering the running condition of the sub-cluster resources.

3) When a new node is added or a fixed time is needed, the Pod needing to be scheduled is screened out through the resource utilization rate of each node and the average value of the resource utilization rates of all the nodes of the cluster, and the Pod is migrated to the node with lower average value of the resource utilization rate.

In order to overcome the above-mentioned drawbacks, the present application provides a method, an apparatus, a software program, an electronic device, and a storage medium for scheduling cluster resources, and fig. 1 is a schematic view of a usage scenario of the method for scheduling cluster resources according to an embodiment of the present application, and referring to fig. 1, with the continuous development of computer technology, a cloud server (Cloud Virtual Machine, CVM) may provide a safe and reliable elastic computing service, and may also provide different instance types to satisfy a user-specific usage scenario. The terminals (including the terminal 10-1 and the terminal 10-2) are provided with corresponding clients capable of executing different functions, wherein the clients are terminals (including the terminal 10-1 and the terminal 10-2) and acquire different information from the corresponding cloud servers 200 through the network 300, and can deploy different services in the cloud servers. The terminal is connected to the server 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to implement data transmission. The example types provided by the cloud server are composed of different combinations of a CPU, a memory, a storage and a network, and service data of a user are stored in a hard disk of the cloud server, but in the operation of the cloud server, a large number of resource fragments are generated in the task processing process, so that the redundancy of resources is caused, the processing speed of the cloud server network is reduced, the task processing speed is influenced, and the use effect of the cloud server network is influenced. In the embodiment provided by the present application, the cloud server application running in the cloud server 200 may be written in software code environments of different programming languages, and the code objects may be different types of code entities. For example, in software code in the C language, a code object may be a function. In software code in JAVA language, a code object may be a class, and in IOS side OC language may be a piece of object code. In the software code in the c++ language, a code object may be a class or a function to execute processing instructions from different terminals. Wherein the sources of the compiling environments of the name cloud server are not distinguished any more in the application.

The following describes the structure of the cluster resource scheduling device in detail in the embodiment of the present invention, and the cluster resource scheduling device may be implemented in various forms, such as a dedicated terminal with a processing function of the cluster resource scheduling device, or may be a server provided with a processing function of the cluster resource scheduling device, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of a cluster resource scheduling device according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the cluster resource scheduling device, but not all the structures, and some or all of the structures shown in fig. 2 may be implemented as needed.

The cluster resource scheduling device provided by the embodiment of the invention comprises at least one processor 201, a memory 202, a user interface 203 and at least one network interface 204. The various components in the cluster resource scheduler are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include any computer program for operating on a terminal (e.g., 10-1), such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, for implementing various basic services and processing tasks based on hardware, and the application programs may include various application programs. The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, so as to realize the execution of the cluster resource scheduling method provided by the invention in various scenes.

In some embodiments, the cluster resource scheduling device provided by the embodiment of the present invention may be implemented by combining software and hardware, and as an example, the cluster resource scheduling device provided by the embodiment of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the cluster resource scheduling method provided by the embodiment of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components.

As an example of implementation of the cluster resource scheduling device provided by the embodiment of the present invention by combining software and hardware, the cluster resource scheduling device provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium, and the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and performs the cluster resource scheduling method provided by the embodiment of the present invention in combination with necessary hardware (including, for example, the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

As an example of implementation of hardware in the cluster resource scheduling apparatus provided by the embodiment of the present invention, the apparatus provided by the embodiment of the present invention may be directly implemented by the processor 201 in the form of a hardware decoding processor, for example, one or more Application specific integrated circuits (ASICs, applications SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), field-Programmable GATE ARRAY, or other electronic components may be used to implement the cluster resource scheduling method provided by the embodiment of the present invention.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the clustered resource scheduling apparatus. Examples of such data include any executable instructions, such as executable instructions, for operating on a cluster resource scheduling device, in which a program implementing a slave cluster resource scheduling method of an embodiment of the invention may be included.

In other embodiments, the cluster resource scheduling device provided by the embodiments of the present invention may be implemented in a software manner, and fig. 2 shows the cluster resource scheduling device stored in the memory 202, which may be software in the form of a program, a plug-in, and the like, and includes a series of modules, and as an example of the program stored in the memory 202, may include the cluster resource scheduling device, where the cluster resource scheduling device includes the following software module information transmission module 2081 and information processing module 2082. When the software modules in the cluster resource scheduling device are read by the processor 201 into the RAM and executed, the cluster resource scheduling method provided by the embodiment of the present invention is implemented, where the functions of each software module in the cluster resource scheduling device include:

Information transmission device 2081 is configured to configure a workload in a clustered resource scheduling environment, and determine a timeout queue carrying the workload.

Information processing device 2082 is configured to adjust the status of the workload to a secondary scheduling status when the workload in the timeout queue reaches a timeout status.

The information processing apparatus 2082 is configured to send a failed copy number detection request to a corresponding estimator component based on information of the workload when the scheduler component determines that the state of the workload is a secondary scheduling state.

The information processing apparatus 2082 is configured to determine, in response to the failed copy number probe request, a number of copies that fail to be scheduled, and send the number of copies that fail to be scheduled to the scheduler component.

The information processing apparatus 2082 is configured to execute a cluster resource scheduler to implement utilizing a maximum available copy number in the cluster resource scheduling environment, based on the copy number of the scheduling failure.

The information processing device 2082 is configured to respond to a cluster resource scheduling mode matched with the cluster resource, and configure a corresponding cluster resource for the task to be processed according to the priority of the task to be processed.

According to the electronic device shown in fig. 2, in one aspect of the application, the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the cluster resource scheduling method described above.

Referring to fig. 3, fig. 3 is an optional flowchart of a cluster resource scheduling method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 3 may be performed by various electronic devices running the cluster resource scheduling apparatus, for example, may be a dedicated terminal with a cluster resource scheduling function, a server, or a control terminal of a server cluster controller, or a cloud network server. The dedicated terminal with the cluster resource scheduling device may be encapsulated in the server 200 shown in fig. 1 to execute the corresponding software module in the cluster resource scheduling device shown in fig. 2. The following is a description of the steps shown in fig. 3.

Step 301, a cluster resource scheduling device configures a workload in a cluster resource scheduling environment and determines a timeout queue carrying the workload.

In some embodiments of the present invention, for the environment of a cloud server cluster, the cluster resource scheduling device may include different types of components, including, for example, a controller component, a scheduler component, and an estimator component, and in particular, referring to fig. 4, fig. 4 is a schematic architecture diagram of the cluster resource scheduling device in the embodiment of the present invention, where the controller component is configured to detect all workloads, and only place workloads that do not reach a desired number of copies in a timeout queue for timing. When a timeout event occurs, the status of the workload is updated, identifying that it needs to be scheduled secondarily. The scheduler component is used for continuously detecting all the workloads, when a state that the workload needs to be secondarily scheduled occurs, the scheduler component sends a request to the estimator to acquire the copy number of the workload, which is failed to be secondarily scheduled in each cluster, and secondary scheduling is performed according to the result. The estimator component is used for detecting cluster copies and nodes of a sub-cluster to count cluster resource usage, and when a failed copy counting request from a scheduler is received, the number of copies which are failed to be scheduled in the cluster is calculated and returned in real time. The controller component, scheduler component, estimator component and user created workload together form a control plane.

The embodiment of the invention can be realized by combining Cloud technology, wherein Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data, and can also be understood as the general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a Cloud computing business model. Background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, so cloud technologies need to be supported by cloud computing.

It should be noted that cloud computing is a computing mode, which distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service as required. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed. As a basic capability provider of cloud computing, a cloud computing resource pool platform, referred to as a cloud platform for short, is generally called an Infrastructure as a service (IaaS, infrasaround AS A SERVICE), and multiple types of virtual resources are deployed in the resource pool for external clients to select for use. The cloud computing resource pool mainly comprises computing equipment (which can be a virtualized machine and comprises an operating system), storage equipment and network equipment. When a user uses the cloud server to store data or deploys different application processes, the running parameters of the server cluster hard disk are detected, so that the possible server cluster hard disk faults can be timely found, and the user data loss caused by the server cluster hard disk faults with failure warning is avoided.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. At present, a storage method of a storage system is that logical volumes are created, and when the logical volumes are created, a physical storage space is allocated to each logical volume, where the physical storage space may be a disk of a certain storage device or a plurality of storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object. The storage system allocates physical storage space for a logical volume, specifically, the physical storage space is divided into stripes in advance according to the set of capacity estimation of objects stored in the logical volume (the estimation often has a large margin with respect to the capacity of the objects actually to be stored) and redundant array of independent disks (RAID, redundant Array of INDEPENDENT DISK), and one logical volume can be understood as one stripe, so that the physical storage space is allocated for the logical volume.

The cluster resource scheduling method is implemented through a cloud server network by means of configuring a workload in a cluster resource scheduling environment and determining a overtime queue for carrying the workload, when the workload in the overtime queue reaches the overtime state, a controller component adjusts the state of the workload into a secondary scheduling state, when the scheduler component determines that the state of the workload is the secondary scheduling state, based on information of the workload, a failure copy number detection request is sent to a corresponding estimator component, the estimator component responds to the failure copy number detection request, determines the number of copies with scheduling failure and sends the number of copies with scheduling failure to the scheduler component, and the scheduler component executes a cluster resource scheduling program based on the number of copies with scheduling failure to achieve the purpose of utilizing the maximum available copy number in the cluster resource scheduling environment, so that availability of the workload is guaranteed, accuracy and reliability of cluster resource scheduling are improved, and use efficiency of cluster resources is improved.

When the method is applied to cloud products, the front end of the cloud products can be a Web UI component, and the Web UI component is used for receiving Spark related parameters filled by users and generating job data according to the Spark related parameters. The Cluster Manager (Cluster Manager) may be an open source Cluster resource scheduling platform such as YARN, mesos or Kubernetes. Spark itself already supports these open source platforms, i.e. the protocols between Spark and ClusterManager components are compatible. Driver is a job Driver, work Node is a Work Node, executor is a task execution component, and task is the smallest execution unit. Further, a structured data package (Spark SQL) is a package used by Spark to manipulate structured data, through which the data can be queried using the SQL language, which supports a variety of data sources such as data warehouse tools (Hive) tables, and the like. The streaming component is a Spark provided component that streams real-time data, providing an application programming interface (API Application Programming Interface) for manipulating the data stream.

Step 302, when the workload in the timeout queue reaches a timeout state, the controller component adjusts the state of the workload to a secondary scheduling state.

In some embodiments of the present invention, configuring a workload in a clustered resource scheduling environment and determining a timeout queue carrying the workload includes:

The controller component determines the expected copy number of the cluster resource scheduling environment, detects the workload in real time based on the expected copy number, and adjusts the workload to the timeout queue when the copy number in the workload is smaller than the expected copy number. Because the cloud server clusters are various in use environments, the value of the expected copy number can be flexibly set according to the use environments of the cloud server clusters, for example, when the cloud server clusters process the financial payment of the instant messaging client or the information of the instant messaging client for fund borrowing and purchasing articles, the expected copy number can be set to 10000 due to the large task number, and for video processing tasks which can be completed only through a single server cluster, the expected copy number can be set to 100 so as to fully use the resources of the server clusters and reduce the resource waste of the server clusters.

In some embodiments of the invention, the controller component detects a workload in the timeout queue based on a desired number of copies when a change occurs in the number of copies of the workload in the timeout queue, maintains the workload in the timeout queue when the number of copies in the workload is less than the desired number of copies, and deletes the workload in the timeout queue when the number of copies in the workload is greater than or equal to the desired number of copies. Because the number of copies of the workload in the timeout queue is dynamically changed data, the workload in the timeout queue can be timely adjusted by detecting the workload in the timeout queue in real time, and the number of the workload for secondary resource adjustment is reduced.

And step 303, when the cluster resource scheduling device scheduler component determines that the state of the workload is a secondary scheduling state, sending a failure copy number detection request to a corresponding estimator component based on the information of the workload.

The information of the Workload (Workload) may include StatefulSet, deployment, replicaSet, daemonset resources. The resource information includes the number of application instances, affinity rules for the application instances, and the like. Only application instances that fit the affinity rules of Workload can be deployed on the compute node. The resource objects in the Kubernetes cluster may be Applications (APP) in the Kubernetes cluster, e.g., one or more of deployment (Deployment), state copy set (StatefulSet), and resources such as routing (Ingress), container group (pod), container (container), service, replication controller (RC, replicationController).

And 304, determining the number of copies with scheduling failure by an estimator component of the cluster resource scheduling device in response to the failed copy number detection request, and sending the number of copies with scheduling failure to the scheduler component.

The operation of determining the number of copies that failed to schedule is further described below with respect to fig. 5.

Referring to fig. 5, fig. 5 is an optional flowchart of a cluster resource scheduling method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 5 may be performed by various electronic devices running the cluster resource scheduling apparatus, for example, may be a dedicated terminal with a cluster resource scheduling function, a server, or a control terminal of a server cluster controller, or a cloud network server. The dedicated terminal with the cluster resource scheduling device may be encapsulated in the server 200 shown in fig. 1 to execute the corresponding software module in the cluster resource scheduling device shown in fig. 2. The following is a description of the steps shown in fig. 5.

The estimator component obtains node information and container group information for all nodes of the sub-cluster in the cluster resource scheduling environment, step 501.

Step 502. An estimator component queries the sub-cluster for a container group associated with a working copy in response to the failed copy number probe request and determines a container group list corresponding to the container group.

In some embodiments of the present invention, when the type of the working copy is a resource type, determining a copy controller object list corresponding to the working copy, searching a Container group list associated with the working copy through a cache of the copy controller object list, wherein, taking K8S as an example, a Kubernetes cluster generally comprises a Master Node (Master) and a plurality of computing nodes (nodes) which are respectively in communication connection with the Master Node, wherein the Master Node is used for managing and controlling the plurality of computing nodes, the computing nodes are used as workload nodes, and comprise an original application program directly deployed in the nodes and a plurality of Container groups (Pod), each Container group is packaged with one or more containers (containers) used for bearing the application program, and Pod is a basic operation unit of Kubernetes and is a minimum creative, debugging and manageable deployment unit. The type of the work copy is a resource type (Deployment type), a type task can be deployed, deployment integrates the functions of online deployment, rolling upgrading, copy creation, online task suspension, online task restoration, online rollback to Deployment of a previous version (success/stability) and the like, to a certain extent, deployment can realize unattended online, complex communication and operation risks in the online process are greatly reduced, for the Deployment type work copy, a ReplicaSet object list associated with the Deployment type can be firstly determined, and then an associated Pod list is found from a cache through a duplicate controller ReplicaSet, wherein the ReplicaSet is one of the kubernetes and is mainly used for controlling the Pod managed by the ReplicaSet, so that the number of the Pod copies is always maintained at a preset number.

In some embodiments of the present invention, when the type of the working copy is a state copy set type, a container group list associated with the working copy is searched in a cache of the working copy, where for the working copy of StatefulSet type, a Pod object list associated with the working copy of StatefulSet type can be directly found from the cache, so as to save searching time and prompt resource scheduling speed.

Step 503, the estimator component queries the container group with failed dispatch from the container group list and calculates the copy number with failed dispatch according to the container group with failed dispatch.

Step 305, the scheduler component of the cluster resource scheduler executes a cluster resource scheduler to implement utilizing the maximum available copy number in the cluster resource scheduling environment based on the failed copy number.

In some embodiments of the present invention, the maximum number of available copies needs to be determined before executing the cluster resource scheduler, specifically, referring to fig. 6, fig. 6 is a schematic flow chart of an alternative method for scheduling cluster resources according to an embodiment of the present invention, and it is to be understood that the steps shown in fig. 6 may be executed by various electronic devices running the cluster resource scheduler, for example, a dedicated terminal with a cluster resource scheduling function, a server, or a control terminal of a server cluster controller, or a cloud network server. The dedicated terminal with the cluster resource scheduling device may be encapsulated in the server 200 shown in fig. 1 to execute the corresponding software module in the cluster resource scheduling device shown in fig. 2. The following is a description of the steps shown in fig. 6.

Step 601, when an estimator component starts, the estimator component acquires node information and container group information of all nodes of a sub-cluster in a cluster resource scheduling environment.

The estimator component filters nodes matching the workload from among all nodes in the clustered resource in response to the maximum available copy number estimation request 602.

Step 603, determining container group information corresponding to each node matched with the workload, and determining the maximum available copy number of each node matched with the workload based on the container group information.

Step 604, determining the maximum number of available copies in the cluster resource scheduling environment based on the maximum number of available copies of each node matching the workload.

The cluster resource scheduling method according to the present invention is described below by taking a cluster resource manager as an example of a resource manager of a micro-letter server, where a usage environment diagram of the cluster resource scheduling method according to the embodiment of the present invention shown in fig. 1 is combined; the terminals (including the terminal 11-1 and the terminal 11-2) are provided with corresponding clients capable of executing different functions, wherein the clients are terminals (including the terminal 11-1 and the terminal 11-2) which acquire different information from corresponding servers 200 through a network 300 through a micro-communication application program to browse, the terminals are connected with the server 200 through the network 300, the network 300 can be a wide area network or a local area network or a combination of the wide area network and the local area network, data transmission is realized by using a wireless link, the server 200 runs a cluster resource manager matched with the micro-communication application program to realize resource scheduling, the terminals (such as the terminal 10-1 and the terminal 10-2 in fig. 1) can be further provided with clients capable of displaying software corresponding to carry out financial loan, such as virtual resources or physical resources to carry out financial activities or clients or plug-ins through virtual resources to carry out financial loan, users can acquire financial institutions or platforms through corresponding clients to carry out financial payment of the clients or instant messaging clients or carry out financial loan goods in instant messaging clients, the server 300 is connected with the network 300 through the network 300 or the combination of the wide area network and the wireless link can be realized by purchasing the network 300. A server (e.g., server 300 of fig. 1) is a server of an enterprise of banks, securities, mutual funds, etc. that provides financial services such as payment, lending, financing, etc. When a user who needs to transact related financial services accesses a service provided by a client server of an enterprise by using a client device, the client server can send out payment tasks by triggering an applet in an instant messaging client of a user terminal, and the number of the tasks is large, so that the expected number of copies can be set to 10000, and in order to avoid the problem of subset resource competition when the server cluster processes the payment tasks, the accuracy and the reliability of cluster resource scheduling are affected, and referring to fig. 7, fig. 7 is an optional flow diagram of the cluster resource scheduling method provided by the embodiment of the invention, the architecture of cluster resource scheduling is shown in fig. 4, and the steps shown in fig. 7 are described below.

Step 701, creating a workload creation, the controller component continues to determine if the workload has reached a desired number of copies.

In step 702, the controller component stores the workload in a timeout queue when the workload of the desired number of copies is not reached.

In step 703, when the workload in the timeout queue generates a copy number update event, the controller component determines whether the desired copy number is reached, if so, the controller component deletes the copy number from the timeout queue, otherwise, rejoins the timeout queue.

Step 704, when the time-out queue has a workload trigger time-out, the controller component adjusts the state of the workload to a secondary scheduling state and writes the workload.

Step 705. The scheduler component detects that the state of the workload is a secondary scheduling state.

Step 706, the scheduler sends a failure copy number detection request to the subset group estimator according to the information of the workload.

Step 707 the estimator component continues with detecting the cluster copies and nodes to count cluster resource usage.

The estimator component calculates the total number of copies that failed the schedule based on the request and returns to the scheduler component, step 708.

Step 709, the scheduler re-executes the scheduler according to the total number of copies of the scheduling failure to generate a secondary scheduling result.

In some embodiments of the present invention, as shown in fig. 4, when the current workload a triggers secondary scheduling and there is a copy scheduling failure in the subset group 1, the workload a expects the copy numbers in the subset group 1, the subset group 2 and the subset group 3 to be r1, r2 and r3, the copy number of the subset group 1 failing to be scheduled is f1, and the maximum available copy numbers in the subset group 2 and the subset group 3 are m2 and m3, respectively.

In the process of generating the secondary scheduling result, when m2+m3> =f1, the scheduler component fixes the expected copy number of the workload A in the sub-cluster 1 to r1-f1, then takes the sub-cluster 2 and the sub-cluster 3 as candidate clusters, and distributes f1 failed copies according to the maximum available copy number proportion m2:m3 to be used as the secondary scheduling result of the failed copies.

In the process of generating the secondary scheduling result, when m2+m3< f1, determining that the secondary scheduling fails, and waiting for the next scheduling until the secondary scheduling is successful.

The invention has the following beneficial technical effects:

The method comprises the steps of configuring a workload in a cluster resource scheduling environment, determining a timeout queue for carrying the workload, adjusting the state of the workload to be a secondary scheduling state by a controller component when the workload in the timeout queue reaches the timeout state, sending a failure copy number detection request to a corresponding estimator component based on information of the workload when the workload is determined to be the secondary scheduling state by the scheduler component, determining the copy number of the failure scheduling in response to the failure copy number detection request, and sending the copy number of the failure scheduling to the scheduler component, and executing a cluster resource scheduler based on the copy number of the failure scheduling to realize utilization of the maximum available copy number in the cluster resource scheduling environment.

The foregoing description of the embodiments of the invention is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method for scheduling cluster resources, the method comprising:

when the estimator component is started, the estimator component acquires node information and container group information of nodes of sub-clusters in the cluster resource scheduling environment;

The estimator component is responsive to a maximum available copy number estimation request to screen nodes matching the workload from among nodes of the cluster resource based on the node information;

determining container group information corresponding to each node matched with the workload, and determining the maximum available copy number of each node matched with the workload based on the container group information;

determining a maximum number of available copies in the clustered resource scheduling environment based on the maximum number of available copies for each node that matches the workload;

2. The method of claim 1, wherein configuring a workload in a clustered resource scheduling environment and determining a timeout queue to carry the workload comprises:

The controller component determining a desired number of copies of the clustered resource scheduling environment;

The controller component detects the workload in real time based on the desired number of copies;

And when the number of copies in the workload is smaller than the expected number of copies, adjusting the workload into the timeout queue.

3. The method according to claim 2, wherein the method further comprises:

when the number of copies of the workload in the timeout queue changes, the controller component detects the workload in the timeout queue based on the expected number of copies;

Maintaining the workload in the timeout queue when the number of copies in the workload is less than the desired number of copies;

and deleting the workload in the timeout queue when the number of copies in the workload is greater than or equal to the expected number of copies.

4. The method of claim 1, wherein the estimator component determining the number of copies failed to schedule in response to the failed copy number probe request comprises:

The estimator component acquires node information and container group information of all nodes of a sub-cluster in a cluster resource scheduling environment;

The estimator component queries the sub-cluster for a container group associated with a working copy in response to the failed copy number detection request and determines a container group list corresponding to the container group;

The estimator component queries a container group with failed dispatch from the container group list and calculates the number of copies with failed dispatch based on the container group with failed dispatch.

5. The method of claim 4, wherein the estimator component, in response to the failed copy number probe request, queries the sub-cluster for a container group associated with a working copy and determines a container group list corresponding to the container group, comprising:

when the type of the working copy is a resource type, determining a copy controller object list corresponding to the working copy;

searching a container group list associated with the working copy through the cache of the copy controller object list;

and when the type of the working copy is the state copy set type, searching a container group list associated with the working copy in a cache of the working copy.

6. A cluster resource scheduling apparatus, the apparatus comprising:

The information processing device is used for sending a failure copy number detection request to a corresponding estimator component based on information of the workload when the scheduler component determines that the state of the workload is a secondary scheduling state, acquiring node information and container group information of nodes of a sub-cluster in the cluster resource scheduling environment when the estimator component is started, responding to the maximum available copy number estimation request, screening nodes matched with the workload from the nodes of the cluster resource based on the node information, determining container group information corresponding to each node matched with the workload, determining the maximum available copy number of each node matched with the workload based on the container group information, and determining the maximum available copy number in the cluster resource scheduling environment based on the maximum available copy number of each node matched with the workload;

7. An electronic device, the electronic device comprising:

A memory for storing executable instructions;

A processor, configured to implement the cluster resource scheduling method according to any one of claims 1 to 5 when executing the executable instructions stored in the memory.

8. A computer readable storage medium storing executable instructions which, when executed by a processor, implement the cluster resource scheduling method of any one of claims 1 to 5.

9. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the cluster resource scheduling method of any one of claims 1 to 5.