UL HPC Tutorial: Singularity with Infiniband
Copyright (c) 2018-2021 UL HPC Team <hpc-team@uni.lu>
Singularity setups
Please refer to the Singularity introduction tutorial for the setups.
Singularity with Infiniband
- Objectives: (through this example we get the singularity configuration for running over Infiniband, tested on Iris cluster)
- Create a container with the required Infiniband libraries on Ubuntu18.04
- Install KerA (stream storage, similar to Apache Kafka)
- Install DFI (data flow interface over RDMA see https://doi.org/10.1145/3448016.3452816)
- Run a distributed KerA (1 coordinator and 2 brokers)
- Configuration validated on the Iris cluster
Step 1: Required software
- Install a VM, user zetta, with Ubuntu 18.04 having docker and singularity installed.
- Git clone KerA from https://gitlab.uni.lu/ocm/kera.git and follow Step 1 from https://gitlab.uni.lu/ocm/kera.git (git submodule update --init --recursive)
- The kera project contains a Dockerfile and its docker-entry-point.sh
Step 2: Create the docker container
- Clean and create the kera docker that will be later used by singularity
sudo docker system prune -a
#Update kera/GNUmakefile JAVAHOME variable, set to JAVAHOME := /root/.sdkman/candidates/java/current
sudo docker build . --tag kera
- The Dockerfile contains steps to install required libraries for compiling KerA and DFI, including the right Mellanox driver for Iris
# ##############################################################################
#
# Builder stage
#
# ##############################################################################
FROM ubuntu:18.04 AS builder
ARG JAVA_VERSION=11.0.2-open
ENV SDKMAN_DIR=/root/.sdkman
WORKDIR /opt
COPY . /opt/kera
#
# Install dependencies
#
RUN apt-get update \
&& DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata
RUN apt-get update \
&& apt-get install --yes \
apt-transport-https \
build-essential \
apt-utils \
ca-certificates \
curl \
doxygen \
g++ \
gdb \
git \
libboost-filesystem-dev \
libboost-program-options-dev \
libboost-system-dev \
libibverbs-dev \
libpcre++-dev \
libssl-dev \
libzookeeper-mt-dev \
procps \
protobuf-compiler \
python3 \
python3-pip \
software-properties-common \
unzip \
wget \
zip \
gcc \
make \
perl \
dkms \
linux-headers-$(uname -r) \
gnupg \
lsb-release \
libprotobuf-dev \
libcrypto++-dev \
libevent-dev \
libboost-all-dev \
libpcre3-dev \
libgtest-dev \
zookeeper \
tk \
libnl-3-dev \
udev \
tcl \
libnl-route-3-dev \
bison \
flex \
libmnl0 \
gfortran \
libgfortran4 \
cmake libtool pkg-config autoconf automake libzmq3-dev libgtest-dev libnuma-dev libcppunit-dev numactl libaio-dev libevent-dev \
&& rm -rf /var/lib/apt/lists/*
RUN cd /usr/src/gtest && cmake CMakeLists.txt && make && cp *.a /usr/lib
RUN cd kera \
&& mnt/MLNX_OFED_LINUX-5.1-2.5.8.0-ubuntu18.04-x86_64/mlnxofedinstall --skip-unsupported-devices-check --without-dkms --add-kernel-support --kernel 5.4.0-42-generic --kernel-sources /usr/src/linux-headers-5.4.0-42-generic/ --without-fw-update --force
RUN cd kera/DFI \
&& mkdir release \
&& cd release/ \
&& cmake .. -DCMAKE_BUILD_TYPE=Release \
&& make \
&& make install
#
# Install Java
#
RUN curl -s "https://get.sdkman.io" | bash \
&& echo "sdkman_auto_answer=true" > $SDKMAN_DIR/etc/config \
&& echo "sdkman_auto_selfupdate=false" >> $SDKMAN_DIR/etc/config
# Source sdkman to make the sdk command available and install java candidate
RUN bash -c "source $SDKMAN_DIR/bin/sdkman-init.sh && sdk install java $JAVA_VERSION"
# Add candidate path to $PATH environment variable
ENV JAVA_HOME="$SDKMAN_DIR/candidates/java/current"
ENV PATH="$JAVA_HOME/bin:$PATH"
#
# Install cmake
#
RUN curl -sSL https://cmake.org/files/v3.11/cmake-3.11.1-Linux-x86_64.tar.gz | tar -xzC . \
&& mv cmake-3.11.1-Linux-x86_64 cmake
ENV PATH="/opt/cmake/bin/:$PATH"
#
# Compile KerArrow
#
RUN cd /opt/kera/kerarrow \
&& mkdir -p cpp/release \
&& cd cpp/release/ \
&& cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_PLASMA=on \
&& make -j12 \
&& make install
#
# Compile KerA
#
RUN cd kera \
&& make clean \
&& make -j12 DEBUG=no INFINIBAND=yes \
&& make install
# ##############################################################################
#
# Final stage
# No JDK is included in the final image.
#
# ##############################################################################
FROM ubuntu:18.04
#
# Installing binaries
#
COPY --from=builder \
/opt/kera/install/bin /opt/kera/install/bin
COPY --from=builder \
/etc/init.d/* /etc/init.d/
COPY --from=builder \
/usr/local/bin/plasma_store \
/opt/kera/install/bin/
#
# Installing libraries
#
COPY --from=builder \
/opt/kera/install/lib/kera/* \
/opt/kera/install/lib/kera/
COPY --from=builder \
/etc/* /etc/
COPY --from=builder \
/usr/lib/x86_64-linux-gnu/libibverbs/libmlx5-rdmav25.so /usr/lib/x86_64-linux-gnu/libibverbs/
COPY --from=builder \
/opt/kera/install/lib/kera/lib* \
/opt/kera/kerarrow/cpp/release/release/lib* \
/usr/local/lib/
COPY --from=builder \
/usr/lib/x86_64-linux-gnu/libzookeeper* \
/usr/lib/x86_64-linux-gnu/libboost* \
/usr/lib/x86_64-linux-gnu/libprotobuf* \
/usr/lib/x86_64-linux-gnu/libpcre* \
/usr/lib/x86_64-linux-gnu/libibverbs* \
/usr/lib/x86_64-linux-gnu/libssl* \
/usr/lib/x86_64-linux-gnu/libcrypto* \
/usr/lib/x86_64-linux-gnu/lib* \
/usr/lib/x86_64-linux-gnu/
COPY --from=builder \
/lib/x86_64-linux-gnu/* /lib/x86_64-linux-gnu/
COPY --from=builder \
/usr/include/infiniband/* /usr/include/infiniband/
COPY --from=builder \
/usr/include/* /usr/include/
COPY --from=builder \
/lib/modules/* /lib/modules/
COPY --from=builder \
/usr/local/lib/lib* /usr/local/lib/
COPY --from=builder \
/usr/local/include/dfi /usr/local/include/dfi
ENV LD_LIBRARY_PATH="/usr/local/lib/:$LD_LIBRARY_PATH"
RUN ldd /usr/local/lib/libkera.so \
&& ldd /opt/kera/install/bin/coordinator \
&& ldd /opt/kera/install/bin/server \
&& ldd /usr/local/lib/libdfi.so
COPY ./docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]
- The docker-entrypoint.sh will later be used when running the singularity container on the Iris cluster.
#!/bin/bash
set -e
case "$1" in
sh|bash)
set -- "$@"
exec "$@"
;;
coordinator)
shift
echo "Running coordinator on `hostname` ${SLURM_PROCID}"
/opt/kera/install/bin/coordinator "$@" 1>$HOME/coordinator.out 2>&1 &
status=$?
if [ $status -ne 0 ]; then
echo "Failed to start KerA coordinator: $status"
exit $status
fi
exec tail -f $(ls -Art $HOME/coordinator.out | tail -n 1)
;;
broker)
if [[ ${SLURM_PROCID} -eq 0 ]]; then
echo "Doing nothing `hostname` $HOSTNAME"
else
echo "Installing on `hostname` $HOSTNAME"
# The plasma store creates the /tmp/plasma socket on startup
/opt/kera/install/bin/plasma_store -m 1000000000 -s /tmp/plasma 1>$HOME/plasma-${SLURM_PROCID}.out 2>&1 &
status=$?
if [ $status -ne 0 ]; then
echo "Failed to start KerA server: $status"
exit $status
fi
sleep 2
shift
/opt/kera/install/bin/server "$@" 1>$HOME/server-${SLURM_PROCID}.out 2>&1 &
status=$?
if [ $status -ne 0 ]; then
echo "Failed to start plasma: $status"
exit $status
fi
exec tail -f $(ls -Art $HOME/plasma-${SLURM_PROCID}.out | tail -n 1) & tail -f $(ls -Art $HOME/server-${SLURM_PROCID}.out | tail -n 1)
fi
;;
esac
Step 3: Create the singularity container
- Either directly create the kera.sif singularity container, or use a sandbox to eventually modify/add before exporting to sif format
#directly create a singularity container
sudo singularity build kera.sif docker-daemon://kera:latest
#create the sandbox directory from existing docker kera container, then create the kera.sif
sudo singularity build --sandbox keraUbuntu1804 docker-daemon://kera:latest
sudo singularity build kera.sif keraUbuntu1804/
Step 4: Create a script to install KerA on Iris
- The following script runKerA.sh runs singularity kera.sif container for creating 1 coordinator and 2 brokers, each on a container instance
#!/bin/bash -l
#SBATCH -J Singularity_KerA_Coord
#SBATCH -N 3 # Nodes
#SBATCH -n 3 # Tasks
#SBATCH --ntasks-per-node=1
#SBATCH --mem=9GB
#SBATCH -c 3 # Cores assigned to each task
#SBATCH --time=0-00:15:00
#SBATCH -p batch
#SBATCH --qos=normal
#SBATCH --mail-user=firstname.lastname@uni.lu
#SBATCH --mail-type=BEGIN,END
module load tools/Singularity
hostName="$(hostname -s)-ib0"
IP=$(getent hosts $hostName | awk '{print $1}')
echo "On your laptop: ssh -p 8022 -NL 8889:$hostName:8889 ${USER}@access-iris.uni.lu"
echo "SLURM_JOBID = ${SLURM_JOBID}"
echo "SLURM_JOB_NODELIST = ${SLURM_JOB_NODELIST}"
echo "SLURM_NNODES = ${SLURM_NNODES}"
echo "SLURM_NTASK = ${SLURM_NTASKS}"
echo "Submission directory = ${SLURM_SUBMIT_DIR}"
export KERA_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
echo "Cores: $KERA_WORKER_CORES"
export DAEMON_MEM=${SLURM_MEM_PER_CPU:=2048}
export KERA_MEM=$(( ${DAEMON_MEM}*${KERA_WORKER_CORES} ))
export KERA_MASTER_HOST=$(hostname -s)
export NWLOCATOR="infrc:host"
#srun --exclusive -N 1 -n 1 -l -o $HOME/coordout \
singularity run --bind /dev/infiniband,/etc/libibverbs.d kera.sif \
coordinator -C $NWLOCATOR=${hostName},port=11100,dev=mlx5_0 --maxCores ${SLURM_CPUS_PER_TASK:-2} -n --reset --clusterName test &
pid=$!
sleep 10s
echo "Starting brokers"
KERA_BROKER_LAUNCHER=${HOME}/kera-start-brokers-${SLURM_JOBID}.sh
echo " - create broker launcher script '${KERA_BROKER_LAUNCHER}'"
cat << 'EOF' > ${KERA_BROKER_LAUNCHER}
#!/bin/bash
echo "I am ${SLURM_PROCID} running on:"
hostname
KERA_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
echo "Cores: $KERA_WORKER_CORES"
DAEMON_MEM=${SLURM_MEM_PER_CPU:=2048}
KERA_MEM=$(( ${DAEMON_MEM}*${KERA_WORKER_CORES} ))
echo "memory: $KERA_MEM"
KERA_WORKER_HOST="`hostname`-ib0"
echo ${KERA_WORKER_HOST}
# --bind /opt/mellanox,/sys/class/infiniband \
# --bind /sys/class/infiniband_cm,/sys/class/infiniband_mad \
# --bind /sys/class/infiniband_verbs
singularity run --bind /dev/infiniband,/etc/libibverbs.d kera.sif \
broker -L $NWLOCATOR=${KERA_WORKER_HOST},port=11101,dev=mlx5_0 --totalMasterMemory ${KERA_MEM} \
-C $NWLOCATOR=$1,port=11100,dev=mlx5_0 --cleanerBalancer fixed:50 -D -d --detectFailures 0 -h 1 \
-f /tmp/storagemaster1 --maxCores ${SLURM_CPUS_PER_TASK:-2} --clusterName test -r 0 --masterOnly \
--numberActiveGroupsPerStreamlet 1 --masterActiveGroupsPerStreamlet 1
EOF
chmod +x ${KERA_BROKER_LAUNCHER}
# Start the KerA brokers; pass coordinator hostname as param service ; for srun: --bind /etc/init.d/
srun --exclusive -N 3 -n 3 --ntasks-per-node=1 -l -o $HOME/broker-$(hostname -s).out \
${KERA_BROKER_LAUNCHER} ${hostName} &
sleep 900s
wait $pid
echo $HOME
echo "Ready Stopping instance"
- Now you can install KerA on Iris. Singularity run will execute the docker-entrypoint.sh which is configured to run on every slurm task except the one running the coordinator.
Login to Iris
sbatch runKerA.sh
Step 5: How the output looks
- The Coordinator (coordinator.out)
1629490163.314847961 CoordinatorMain.cc:110 in main NOTICE[1]: Command line: /opt/kera/install/bin/coordinator -C infrc:host=iris-079-ib0,port=11100,dev=mlx5_0 --maxCores 3 -n --reset --clusterName test
1629490163.314866331 CoordinatorMain.cc:111 in main NOTICE[1]: Coordinator process id: 23390
1629490163.422875345 Infiniband.h:106 in DeviceList WARNING[1]: identified infiniband device: mlx5_0
1629490163.422888057 Infiniband.h:117 in lookup WARNING[1]: looking to open infiniband device: mlx5_0 searching mlx5_0
1629490163.442540847 InfRcTransport.cc:263 in InfRcTransport NOTICE[1]: InfRc listening on UDP: 172.19.6.79:11100
1629490163.444650651 InfRcTransport.cc:272 in InfRcTransport NOTICE[1]: Local Infiniband lid is 108
1629490163.621857932 CoordinatorMain.cc:122 in main NOTICE[1]: coordinator: Listening on infrc:host=iris-079-ib0,port=11100,dev=mlx5_0
1629490163.621874294 CoordinatorMain.cc:125 in main NOTICE[1]: PortTimeOut=-1
1629490163.621876873 PortAlarm.cc:174 in setPortTimeout NOTICE[1]: Set PortTimeout to -1 (ms: -1 to disable.)
1629490163.621883782 CoordinatorMain.cc:146 in main WARNING[1]: Reset requested: deleting external storage for workspace '/ramcloud/test'
1629490163.621891380 CoordinatorMain.cc:151 in main NOTICE[1]: Cluster name is 'test', external storage workspace is '/ramcloud/test/'
1629490163.623065380 CoordinatorClusterClock.cc:170 in recoverClusterTime WARNING[1]: couldn't find "coordinatorClusterClock" object in external storage; starting new clock from zero; benign if starting new cluster from scratch, may cause linearizability failures otherwise
1629490163.623074634 CoordinatorClusterClock.cc:176 in recoverClusterTime NOTICE[1]: initializing CoordinatorClusterClock: startingClusterTime = 0
1629490163.624238702 CoordinatorUpdateManager.cc:82 in init WARNING[7]: couldn't find "coordinatorUpdateManager" object in external storage; starting new cluster from scratch
1629490163.625497673 CoordinatorServerList.cc:412 in recover NOTICE[7]: CoordinatorServerList recovery completed: 0 master(s), 0 backup(s), 0 update(s) to disseminate, server list version is 0
1629490163.625508879 TableManager.cc:808 in recover NOTICE[7]: Table recovery complete: 0 table(s)
1629490163.625516433 CoordinatorService.cc:125 in init NOTICE[7]: Coordinator state has been recovered from external storage; starting service
1629490163.627406243 MemoryMonitor.cc:76 in handleTimerEvent NOTICE[8]: Memory usage now 676 MB (increased 676 MB)
1629490180.817172953 CoordinatorServerList.cc:160 in enlistServer NOTICE[5]: Enlisting server at infrc:host=iris-080-ib0,port=11101,dev=mlx5_0 (server id 1.0) supporting services: MASTER_SERVICE, ADMIN_SERVICE
1629490181.047367269 CoordinatorServerList.cc:160 in enlistServer NOTICE[5]: Enlisting server at infrc:host=iris-082-ib0,port=11101,dev=mlx5_0 (server id 2.0) supporting services: MASTER_SERVICE, ADMIN_SERVICE
- The Brokers (server-1.out, server-2.out)
1629490175.584982556 ServerMain.cc:289 in main NOTICE[1]: Command line: /opt/kera/install/bin/server -L infrc:host=iris-080-ib0,port=11101,dev=mlx5_0 --totalMasterMemory 6144 -C infrc:host=iris-079-ib0,port=11100,dev=mlx5_0 --cleanerBalancer fixed:50 -D -d --detectFailures 0 -h 1 -f /tmp/storagemaster1 --maxCores 3 --clusterName test -r 0 --masterOnly --numberActiveGroupsPerStreamlet 1 --masterActiveGroupsPerStreamlet 1
1629490175.585006261 ServerMain.cc:290 in main NOTICE[1]: Server process id: 22528
1629490175.633273359 Infiniband.h:106 in DeviceList WARNING[1]: identified infiniband device: mlx5_0
1629490175.633290399 Infiniband.h:117 in lookup WARNING[1]: looking to open infiniband device: mlx5_0 searching mlx5_0
1629490175.650759076 InfRcTransport.cc:263 in InfRcTransport NOTICE[1]: InfRc listening on UDP: 172.19.6.80:11101
1629490175.652109813 InfRcTransport.cc:272 in InfRcTransport NOTICE[1]: Local Infiniband lid is 110
1629490175.824960501 ServerMain.cc:319 in main NOTICE[1]: MASTER_SERVICE, ADMIN_SERVICE: Listening on infrc:host=iris-080-ib0,port=11101,dev=mlx5_0
1629490175.825434143 ServerMain.cc:360 in main NOTICE[1]: Using 0 backups
1629490175.825442948 ServerConfig.h:619 in setLogAndHashTableSize NOTICE[1]: Master to allocate 6442450944 bytes total, 1048576 of which are for the hash table
1629490175.825444554 ServerConfig.h:621 in setLogAndHashTableSize NOTICE[1]: Master will have 767 segments and 16384 lines in the hash table
1629490175.825445520 ServerConfig.h:625 in setLogAndHashTableSize NOTICE[1]: Hash table will have one entry for every 49144 bytes in the log
1629490175.825460967 ServerMain.cc:365 in main NOTICE[1]: PortTimeOut=-1
1629490175.825462037 PortAlarm.cc:174 in setPortTimeout NOTICE[1]: Set PortTimeout to -1 (ms: -1 to disable.)
1629490175.825756452 MemoryMonitor.cc:76 in handleTimerEvent NOTICE[4]: Memory usage now 647 MB (increased 647 MB)
1629490175.825781222 Server.cc:101 in run NOTICE[1]: Starting services
1629490175.825795738 Server.cc:165 in createAndRegisterServices NOTICE[1]: Starting master service
1629490175.825796613 Server.cc:166 in createAndRegisterServices NOTICE[1]: Master is using 0 backups
1629490175.825849161 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 0 of 6143 MB
1629490176.442748331 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 1024 of 6143 MB
1629490177.050605836 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 2048 of 6143 MB
1629490177.636708533 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 3072 of 6143 MB
1629490178.230278663 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 4096 of 6143 MB
1629490178.811673036 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 5120 of 6143 MB
1629490179.396368849 SegletAllocator.cc:171 in initializeEmergencyHeadReserve NOTICE[1]: Reserved 2 seglets for emergency head segments (16 MB). 765 seglets (6120 MB) left in default pool.
1629490179.657004084 InfRcTransport.h:118 in registerMemory NOTICE[1]: Registered 6441402368 bytes at 0x40000000
1629490179.657069138 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 0 of 1 MB
1629490179.658002890 SegletAllocator.cc:206 in initializeCleanerReserve NOTICE[1]: Reserved 1 seglets for the cleaner (8 MB). 764 seglets (6112 MB) left in default pool.
1629490179.658010446 LogCleaner.cc:898 in FixedBalancer NOTICE[1]: Using fixed balancer with 50% disk cleaning
1629490179.659099100 MultiFileStorage.cc:1063 in MultiFileStorage NOTICE[1]: Backup storage opened with 4294967296 bytes available; allocated 512 frame(s) across 1 file(s) with 8388608 bytes per frame
1629490179.740875053 BackupStorage.cc:82 in benchmark NOTICE[1]: Backup storage speeds (min): 1251 MB/s read
1629490179.740897688 BackupStorage.cc:83 in benchmark NOTICE[1]: Backup storage speeds (avg): 1613 MB/s read,
1629490179.740898718 BackupStorage.cc:89 in benchmark NOTICE[1]: RANDOM_REFINE_AVG BackupStrategy selected
1629490179.740931574 MultiFileStorage.cc:1556 in tryLoadSuperblock NOTICE[1]: Stored superblock had a bad checksum: stored checksum was 0, but stored data had checksum 88a5c087
1629490179.740935386 MultiFileStorage.cc:1556 in tryLoadSuperblock NOTICE[1]: Stored superblock had a bad checksum: stored checksum was 0, but stored data had checksum 88a5c087
1629490179.740936381 MultiFileStorage.cc:1342 in loadSuperblock WARNING[1]: Backup couldn't find existing superblock; starting as fresh backup.
1629490179.740939997 PersistenceManager.cc:88 in PersistenceManager NOTICE[1]: Backup storing replicas with clusterName 'test'. Future backups must be restarted with the same clusterName for replicas stored on this backup to be reused.
1629490179.740941776 PersistenceManager.cc:102 in PersistenceManager NOTICE[1]: Replicas stored on disk have a different clusterName ('__unnamed__'). Scribbling storage to ensure any stale replicas left behind by old backups aren't used by future backups
socket() suceeded for pathname /tmp/plasma
Socket pathname is ok.
socket_fd: 11
1629490179.888345368 Server.cc:168 in createAndRegisterServices NOTICE[1]: Master service started
1629490179.888356298 Server.cc:180 in createAndRegisterServices NOTICE[1]: Starting admin service
1629490179.888359239 Server.cc:184 in createAndRegisterServices NOTICE[1]: Admin service started
1629490179.888360194 Server.cc:103 in run NOTICE[1]: Services started
1629490179.888361061 Server.cc:108 in run NOTICE[1]: Pinning memory
1629490180.794964742 Server.cc:110 in run NOTICE[1]: Memory pinned
1629490180.795133039 MemoryMonitor.cc:76 in handleTimerEvent NOTICE[4]: Memory usage now 7844 MB (increased 7197 MB)
1629490180.795160786 Server.cc:211 in enlist NOTICE[4]: Enlisting with coordinator
1629490180.813952484 CoordinatorSession.cc:119 in getSession NOTICE[4]: Opened session with coordinator at infrc:host=iris-079-ib0,port=11100,dev=mlx5_0
1629490180.814215878 Server.cc:218 in enlist NOTICE[4]: Enlisted; serverId 1.0
socket() suceeded for pathname /tmp/plasma
Socket pathname is ok.
socket_fd: 12
1629490180.814895298 Server.cc:229 in enlist NOTICE[4]: Created objectServerId 1 ? 1
1629490180.814918048 MasterService.cc:935 in initOnceEnlisted NOTICE[4]: My server ID is 1.0
1629490180.814923772 PersistenceManager.cc:150 in initOnceEnlisted NOTICE[4]: pm My server ID is 1.0
1629490180.821292852 ServerList.cc:200 in applyServerList NOTICE[7]: Server 1.0 is up (server list version 1)
1629490180.836741479 PersistenceManager.cc:156 in initOnceEnlisted NOTICE[4]: PersistenceManager 1.0 will store replicas under cluster name 'test'
1629490181.044582504 ServerList.cc:200 in applyServerList NOTICE[7]: Server 2.0 is up (server list version 2)
1629490175.568762266 ServerMain.cc:289 in main NOTICE[1]: Command line: /opt/kera/install/bin/server -L infrc:host=iris-082-ib0,port=11101,dev=mlx5_0 --totalMasterMemory 6144 -C infrc:host=iris-079-ib0,port=11100,dev=mlx5_0 --cleanerBalancer fixed:50 -D -d --detectFailures 0 -h 1 -f /tmp/storagemaster1 --maxCores 3 --clusterName test -r 0 --masterOnly --numberActiveGroupsPerStreamlet 1 --masterActiveGroupsPerStreamlet 1
1629490175.568805199 ServerMain.cc:290 in main NOTICE[1]: Server process id: 4420
1629490175.635965138 Infiniband.h:106 in DeviceList WARNING[1]: identified infiniband device: mlx5_0
1629490175.635992010 Infiniband.h:117 in lookup WARNING[1]: looking to open infiniband device: mlx5_0 searching mlx5_0
1629490175.656650536 InfRcTransport.cc:263 in InfRcTransport NOTICE[1]: InfRc listening on UDP: 172.19.6.82:11101
1629490175.657721727 InfRcTransport.cc:272 in InfRcTransport NOTICE[1]: Local Infiniband lid is 112
1629490175.926597137 ServerMain.cc:319 in main NOTICE[1]: MASTER_SERVICE, ADMIN_SERVICE: Listening on infrc:host=iris-082-ib0,port=11101,dev=mlx5_0
1629490175.927065771 ServerMain.cc:360 in main NOTICE[1]: Using 0 backups
1629490175.927074036 ServerConfig.h:619 in setLogAndHashTableSize NOTICE[1]: Master to allocate 6442450944 bytes total, 1048576 of which are for the hash table
1629490175.927075508 ServerConfig.h:621 in setLogAndHashTableSize NOTICE[1]: Master will have 767 segments and 16384 lines in the hash table
1629490175.927076437 ServerConfig.h:625 in setLogAndHashTableSize NOTICE[1]: Hash table will have one entry for every 49144 bytes in the log
1629490175.927092323 ServerMain.cc:365 in main NOTICE[1]: PortTimeOut=-1
1629490175.927093398 PortAlarm.cc:174 in setPortTimeout NOTICE[1]: Set PortTimeout to -1 (ms: -1 to disable.)
1629490175.930640196 MemoryMonitor.cc:76 in handleTimerEvent NOTICE[5]: Memory usage now 647 MB (increased 647 MB)
1629490175.930684453 Server.cc:101 in run NOTICE[1]: Starting services
1629490175.930690202 Server.cc:165 in createAndRegisterServices NOTICE[1]: Starting master service
1629490175.930691081 Server.cc:166 in createAndRegisterServices NOTICE[1]: Master is using 0 backups
1629490175.930716322 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 0 of 6143 MB
1629490176.512783068 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 1024 of 6143 MB
1629490177.123410103 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 2048 of 6143 MB
1629490177.728711052 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 3072 of 6143 MB
1629490178.323984649 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 4096 of 6143 MB
1629490178.923948175 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 5120 of 6143 MB
1629490179.526665376 SegletAllocator.cc:171 in initializeEmergencyHeadReserve NOTICE[1]: Reserved 2 seglets for emergency head segments (16 MB). 765 seglets (6120 MB) left in default pool.
1629490179.771735859 InfRcTransport.h:118 in registerMemory NOTICE[1]: Registered 6441402368 bytes at 0x40000000
1629490179.771841667 LargeBlockOfMemory.h:255 in mmapGigabyteAligned NOTICE[1]: Populating pages; progress 0 of 1 MB
1629490179.773289133 SegletAllocator.cc:206 in initializeCleanerReserve NOTICE[1]: Reserved 1 seglets for the cleaner (8 MB). 764 seglets (6112 MB) left in default pool.
1629490179.773299815 LogCleaner.cc:898 in FixedBalancer NOTICE[1]: Using fixed balancer with 50% disk cleaning
1629490179.774910848 MultiFileStorage.cc:1063 in MultiFileStorage NOTICE[1]: Backup storage opened with 4294967296 bytes available; allocated 512 frame(s) across 1 file(s) with 8388608 bytes per frame
1629490179.873532112 BackupStorage.cc:82 in benchmark NOTICE[1]: Backup storage speeds (min): 1006 MB/s read
1629490179.873542937 BackupStorage.cc:83 in benchmark NOTICE[1]: Backup storage speeds (avg): 1335 MB/s read,
1629490179.873543851 BackupStorage.cc:89 in benchmark NOTICE[1]: RANDOM_REFINE_AVG BackupStrategy selected
1629490179.873580991 MultiFileStorage.cc:1556 in tryLoadSuperblock NOTICE[1]: Stored superblock had a bad checksum: stored checksum was 0, but stored data had checksum 88a5c087
1629490179.873585051 MultiFileStorage.cc:1556 in tryLoadSuperblock NOTICE[1]: Stored superblock had a bad checksum: stored checksum was 0, but stored data had checksum 88a5c087
1629490179.873586077 MultiFileStorage.cc:1342 in loadSuperblock WARNING[1]: Backup couldn't find existing superblock; starting as fresh backup.
1629490179.873588877 PersistenceManager.cc:88 in PersistenceManager NOTICE[1]: Backup storing replicas with clusterName 'test'. Future backups must be restarted with the same clusterName for replicas stored on this backup to be reused.
1629490179.873590437 PersistenceManager.cc:102 in PersistenceManager NOTICE[1]: Replicas stored on disk have a different clusterName ('__unnamed__'). Scribbling storage to ensure any stale replicas left behind by old backups aren't used by future backups
socket() suceeded for pathname /tmp/plasma
Socket pathname is ok.
socket_fd: 11
1629490180.006412927 Server.cc:168 in createAndRegisterServices NOTICE[1]: Master service started
1629490180.006427969 Server.cc:180 in createAndRegisterServices NOTICE[1]: Starting admin service
1629490180.006431221 Server.cc:184 in createAndRegisterServices NOTICE[1]: Admin service started
1629490180.006432209 Server.cc:103 in run NOTICE[1]: Services started
1629490180.006433238 Server.cc:108 in run NOTICE[1]: Pinning memory
1629490181.028210772 Server.cc:110 in run NOTICE[1]: Memory pinned
1629490181.028383832 MemoryMonitor.cc:76 in handleTimerEvent NOTICE[5]: Memory usage now 7844 MB (increased 7197 MB)
1629490181.028405111 Server.cc:211 in enlist NOTICE[5]: Enlisting with coordinator
1629490181.044703448 CoordinatorSession.cc:119 in getSession NOTICE[5]: Opened session with coordinator at infrc:host=iris-079-ib0,port=11100,dev=mlx5_0
1629490181.044828091 Server.cc:218 in enlist NOTICE[5]: Enlisted; serverId 2.0
socket() suceeded for pathname /tmp/plasma
Socket pathname is ok.
socket_fd: 12
1629490181.045516643 Server.cc:229 in enlist NOTICE[5]: Created objectServerId 2 ? 1
1629490181.045539241 MasterService.cc:935 in initOnceEnlisted NOTICE[5]: My server ID is 2.0
1629490181.045545091 PersistenceManager.cc:150 in initOnceEnlisted NOTICE[5]: pm My server ID is 2.0
1629490181.046144290 PersistenceManager.cc:156 in initOnceEnlisted NOTICE[5]: PersistenceManager 2.0 will store replicas under cluster name 'test'
1629490181.051408793 ServerList.cc:200 in applyServerList NOTICE[6]: Server 1.0 is up (server list version 2)
1629490181.051427966 ServerList.cc:200 in applyServerList NOTICE[6]: Server 2.0 is up (server list version 2)